On 13/11/2023 11:46, Eelco Chaudron wrote:
>
>
> On 12 Nov 2023, at 8:22, Roi Dayan wrote:
>
>> On 10/11/2023 16:23, Aaron Conole wrote:
>>> Roi Dayan <[email protected]> writes:
>>>
>>>> On 08/11/2023 21:04, Aaron Conole wrote:
>>>>> Roi Dayan via dev <[email protected]> writes:
>>>>>
>>>>>> On 02/11/2023 16:11, Eelco Chaudron wrote:
>>>>>>>
>>>>>>> On 2 Nov 2023, at 14:20, Roi Dayan via dev wrote:
>>>>>>>
>>>>>>>> Add personal words list as spellcheck.txt and load it
>>>>>>>> into enchant spell checker. This file is generated from
>>>>>>>> codespell dictionary.txt and contains words users use
>>>>>>>> but enchant spell checker failed on like
>>>>>>>> refcount, pthread, enqueuing, etc.
>>>>>>>>
>>>>>>>> Signed-off-by: Roi Dayan <[email protected]>
>>>>>>> Thanks for the patch, but it doesn’t look right to add the full list
>>>>>>> of words to the OVS repository.
>>>>>>>
>>>>>>> Maybe we can update the extra_keywords list with the most common
>>>>>>> missing ones, and add a command line option to include a
>>>>>>> user-defined file for people who want this?
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>
>>>>>> I think it is needed. It's a dictionary of most commonly used words
>>>>>> and the enchant spell check does not seem to be enough.
>>>>>> Some examples enchant fails as I remember are:
>>>>>> lacp, dereferenced, valgrind, priv, syscall,..
>>>>>
>>>>> Well, we do add some of those - BUT I see that codespell probably has a
>>>>> more complete dictionary.
>>>>>
>>>>> The original implementation using enchant was due to there not being a
>>>>> clear winner at the time. Enchant is a spell checking frontend intended
>>>>> for lots of tools, so seemed like a good idea (for example, used by
>>>>> libreoffice, AbiWord, and others).
>>>>>
>>>>> It may be that codepspell is more appropriate since it is targeted at
>>>>> development spell checking. OTOH, codespell will miss lots of
>>>>> misspellings where enchant will have a larger lexicon. I have no real
>>>>> opinion on which framework makes sense - but we shouldn't include a
>>>>> dictionary. After all, even linux checkpatch.pl will find the codespell
>>>>> dictionary and just use that as it exists.
>>>>>
>>>>> However, I will point out that there *is* a difference between the two.
>>>>> Here's a simple example:
>>>>>
>>>>> 02:01:19 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ echo
>>>>> vailgrind | codespell -
>>>>> 02:01:24 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$
>>>>>
>>>>> vs.
>>>>>
>>>>> 02:00:18 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$
>>>>> ./utilities/checkpatch.py -S test.patch
>>>>> == Checking "test.patch" ==
>>>>> WARNING: Possible misspelled word: "vaigrind"
>>>>> Did you mean: ['grinder', 'valgrind']
>>>>>
>>>>> So I guess we can probably do something better - and maybe that
>>>>> something is to find the codespell dictionary cut it up and merge the
>>>>> values with the current session (instead of add(), ie: 2/2 in this
>>>>> series).
>>>>>
>>>>> But I don't think we need to copy the entire codespell dict.
>>>>>
>>>>
>>>> In linux checkpatch find codespell dictonary.txt, loads it into memory and
>>>> cut on '>' to get the word needed and looks like internal perl code does
>>>> the spell check.
>>>>
>>>> I can do the same here to find codespell dictionary.txt and cut on '>'
>>>> and I get the same dictionary I tried to add here but I need to create
>>>> a temporary file to load it with enchant.DictWithPWL()
>>>>
>>>> As doing a loop of calls to spell_check_dict.add_to_session() seems
>>>> to take a lot longer for the library by far as shown.
>>>>
>>>> Will that be ok?
>>>>
>>>> This is the example diff. I can send as an update patch if its ok.
>>>> Looking for the codespell dictionary as it can be different folder between
>>>> fedora and ubuntu for example.
>>>> Then loading it and creating temporary file to load into enchant which
>>>> is being removed when file is closed.
>>>
>>> I think it's a good idea to use the codespell dict, but it turns out we
>>> don't actually need the temp file. I've tested with the add_to_session
>>> patch, and here is my timing output:
>>>
>>> 09:19:43 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ time
>>> ./utilities/checkpatch.py -S -1 test.patch
>>> == Checking 594d145410c5 ("readthedocs: Use dirhtml builder.") ==
>>> WARNING: Possible misspelled word: "readthedocs"
>>> Did you mean: ['headteachers']
>>> WARNING: Possible misspelled word: "dirhtml"
>>> Did you mean: ['dirtily']
>>> Subject: readthedocs: Use dirhtml builder.
>>> WARNING: Possible misspelled word: "ReadTheDocs"
>>> Did you mean: ['Headteachers']
>>> Lines checked: 44, Warnings: 3, Errors: 0
>>>
>>>
>>> real 0m0.369s
>>> user 0m0.344s
>>> sys 0m0.024s
>>>
>>> The difference being that I don't use a tempfile - just your delta above
>>> as:
>>>
>>> if codespell_file != '':
>>> with open(codespell_file) as f:
>>> for line in f.readlines():
>>> words = line.strip().split('>')[1].strip(',
>>> ').split(',')
>>> for word in words:
>>> spell_check_dict.add_to_session(word)
>>>
>>>
>>> Of course, looks like we would want to add other things like readthedocs,
>>> dirhtml, etc. Or possibly skip checking words that are proper nouns
>>> (like ReadTheDocs would be). Codespell ignores words it doesn't have a
>>> correction / suggestion for, while enchant will assume something it
>>> doesn't know is misspelled. This makes it so we will flag more often.
>>>
>>> WDYT?
>>
>>
>> This looks good.
>> We shouldn't lag more often than now as we only add more familiar words
>> to enchant session.
>>
>> Do you want me to create a proper patch and send it or you will do it
>> when taking the 2nd patch?
>
> I guess, you sending a new patch would be preferred as it can be reviewed by
> others.
>
> //Eelco
>
sure. will send v2 the two patches. thanks.
>>>
>>>> --- a/utilities/checkpatch.py
>>>> +++ b/utilities/checkpatch.py
>>>> @@ -37,8 +37,19 @@ spell_check_dict = None
>>>>
>>>>
>>>> def open_spell_check_dict():
>>>> + import tempfile
>>>> import enchant
>>>>
>>>> + try:
>>>> + import codespell_lib
>>>> + codespell_dir = os.path.dirname(codespell_lib.__file__)
>>>> + codespell_file = os.path.join(codespell_dir, 'data',
>>>> 'dictionary.txt')
>>>> + if not os.path.exists(codespell_file):
>>>> + codespell_file = ''
>>>> + except:
>>>> + codespell_file = ''
>>>> +
>>>> +
>>>> try:
>>>> extra_keywords = ['ovs', 'vswitch', 'vswitchd', 'ovs-vswitchd',
>>>> 'netdev', 'selinux', 'ovs-ctl', 'dpctl',
>>>> 'ofctl',
>>>> @@ -93,7 +104,19 @@ def open_spell_check_dict():
>>>> global spell_check_dict
>>>> word_dic =
>>>> os.path.join(os.path.dirname(os.path.abspath(__file__)),
>>>> 'dictionary.txt')
>>>> - spell_check_dict = enchant.DictWithPWL('en_US', word_dic)
>>>> +
>>>> + if codespell_file:
>>>> + with open(codespell_file) as f:
>>>> + with tempfile.NamedTemporaryFile(mode='w') as f2:
>>>> + for line in f.readlines():
>>>> + words = line.strip().split('>')[1].strip(',
>>>> ').split(',')
>>>> + for word in words:
>>>> + f2.write(word.strip()+'\n')
>>>> +
>>>> + spell_check_dict = enchant.DictWithPWL('en_US',
>>>> f2.name)
>>>> + else:
>>>> + spell_check_dict = enchant.Dict("en_US")
>>>> +
>>>> for kw in extra_keywords:
>>>> spell_check_dict.add_to_session(kw)
>>>>
>>>>
>>>>
>>>>
>>>> Example run with time, ignore the warning
>>>>
>>>> $ time python3 utilities/checkpatch.py -1 -S
>>>> == Checking dcfe1402d26e ("checkpatch: Fix personal word list storage.") ==
>>>> ERROR: Committer Roi Dayan <[email protected]> needs to sign off.
>>>> Lines checked: 31, Warnings: 0, Errors: 1
>>>>
>>>>
>>>> real 0m0.517s
>>>> user 0m0.304s
>>>> sys 0m0.067s
>>>>
>>>>
>>>>
>>>>>> Also adding the entire dictionary the script is even faster than
>>>>>> adding word by word as done now.
>>>>>>
>>>>>> I think maybe removing the add word by word part at all but checking and
>>>>>> doing
>>>>>> in steps.
>>>>>>
>>>>>> Just by adding the dictionary and having the words being added through
>>>>>> python
>>>>>> already exists seems to be faster.
>>>>>>
>>>>>> Checking small commit before loading dictionry.txt:
>>>>>>
>>>>>> $ time ./utilities/checkpatch.py -S -1
>>>>>>
>>>>>> real 0m28.379s
>>>>>> user 0m0.272s
>>>>>> sys 0m0.223s
>>>>>>
>>>>>>
>>>>>> and after:
>>>>>>
>>>>>> $ time ./utilities/checkpatch.py -S -1
>>>>>>
>>>>>> real 0m0.238s
>>>>>> user 0m0.138s
>>>>>> sys 0m0.038s
>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Eelco
>>>>>>>
>>>>>>>> ---
>>>>>>>> utilities/automake.mk | 1 +
>>>>>>>> utilities/checkpatch.py | 4 +-
>>>>>>>> utilities/dictionary.txt | 16161 +++++++++++++++++++++++++++++++++++++
>>>>>>>> 3 files changed, 16165 insertions(+), 1 deletion(-)
>>>>>>>> create mode 100644 utilities/dictionary.txt
>>>>>>
>>>>>> _______________________________________________
>>>>>> dev mailing list
>>>>>> [email protected]
>>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>>>
>>>
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev