On 12 Nov 2023, at 8:22, Roi Dayan wrote:

> On 10/11/2023 16:23, Aaron Conole wrote:
>> Roi Dayan <[email protected]> writes:
>>
>>> On 08/11/2023 21:04, Aaron Conole wrote:
>>>> Roi Dayan via dev <[email protected]> writes:
>>>>
>>>>> On 02/11/2023 16:11, Eelco Chaudron wrote:
>>>>>>
>>>>>> On 2 Nov 2023, at 14:20, Roi Dayan via dev wrote:
>>>>>>
>>>>>>> Add personal words list as spellcheck.txt and load it
>>>>>>> into enchant spell checker. This file is generated from
>>>>>>> codespell dictionary.txt and contains words users use
>>>>>>> but enchant spell checker failed on like
>>>>>>> refcount, pthread, enqueuing, etc.
>>>>>>>
>>>>>>> Signed-off-by: Roi Dayan <[email protected]>
>>>>>> Thanks for the patch, but it doesn’t look right to add the full list
>>>>>> of words to the OVS repository.
>>>>>>
>>>>>> Maybe we can update the extra_keywords list with the most common
>>>>>> missing ones, and add a command line option to include a
>>>>>> user-defined file for people who want this?
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>
>>>>> I think it is needed. It's a dictionary of most commonly used words
>>>>> and the enchant spell check does not seem to be enough.
>>>>> Some examples enchant fails as I remember are:
>>>>> lacp, dereferenced, valgrind, priv, syscall,..
>>>>
>>>> Well, we do add some of those - BUT I see that codespell probably has a
>>>> more complete dictionary.
>>>>
>>>> The original implementation using enchant was due to there not being a
>>>> clear winner at the time.  Enchant is a spell checking frontend intended
>>>> for lots of tools, so seemed like a good idea (for example, used by
>>>> libreoffice, AbiWord, and others).
>>>>
>>>> It may be that codepspell is more appropriate since it is targeted at
>>>> development spell checking.  OTOH, codespell will miss lots of
>>>> misspellings where enchant will have a larger lexicon.  I have no real
>>>> opinion on which framework makes sense - but we shouldn't include a
>>>> dictionary.  After all, even linux checkpatch.pl will find the codespell
>>>> dictionary and just use that as it exists.
>>>>
>>>> However, I will point out that there *is* a difference between the two.
>>>> Here's a simple example:
>>>>
>>>>   02:01:19 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ echo vailgrind 
>>>> | codespell -
>>>>   02:01:24 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$
>>>>
>>>> vs.
>>>>
>>>>   02:00:18 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$
>>>> ./utilities/checkpatch.py -S test.patch
>>>>   == Checking "test.patch" ==
>>>>   WARNING: Possible misspelled word: "vaigrind"
>>>>   Did you mean:  ['grinder', 'valgrind']
>>>>
>>>> So I guess we can probably do something better - and maybe that
>>>> something is to find the codespell dictionary cut it up and merge the
>>>> values with the current session (instead of add(), ie: 2/2 in this
>>>> series).
>>>>
>>>> But I don't think we need to copy the entire codespell dict.
>>>>
>>>
>>> In linux checkpatch find codespell dictonary.txt, loads it into memory and
>>> cut on '>' to get the word needed and looks like internal perl code does
>>> the spell check.
>>>
>>> I can do the same here to find codespell dictionary.txt and cut on '>'
>>> and I get the same dictionary I tried to add here but I need to create
>>> a temporary file to load it with enchant.DictWithPWL()
>>>
>>> As doing a loop of calls to spell_check_dict.add_to_session() seems
>>> to take a lot longer for the library by far as shown.
>>>
>>> Will that be ok?
>>>
>>> This is the example diff. I can send as an update patch if its ok.
>>> Looking for the codespell dictionary as it can be different folder between
>>> fedora and ubuntu for example.
>>> Then loading it and creating temporary file to load into enchant which
>>> is being removed when file is closed.
>>
>> I think it's a good idea to use the codespell dict, but it turns out we
>> don't actually need the temp file.  I've tested with the add_to_session
>> patch, and here is my timing output:
>>
>> 09:19:43 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ time 
>> ./utilities/checkpatch.py -S -1 test.patch
>> == Checking 594d145410c5 ("readthedocs: Use dirhtml builder.") ==
>> WARNING: Possible misspelled word: "readthedocs"
>> Did you mean:  ['headteachers']
>> WARNING: Possible misspelled word: "dirhtml"
>> Did you mean:  ['dirtily']
>> Subject: readthedocs: Use dirhtml builder.
>> WARNING: Possible misspelled word: "ReadTheDocs"
>> Did you mean:  ['Headteachers']
>> Lines checked: 44, Warnings: 3, Errors: 0
>>
>>
>> real 0m0.369s
>> user 0m0.344s
>> sys  0m0.024s
>>
>> The difference being that I don't use a tempfile - just your delta above
>> as:
>>
>>         if codespell_file != '':
>>             with open(codespell_file) as f:
>>                 for line in f.readlines():
>>                     words = line.strip().split('>')[1].strip(', ').split(',')
>>                     for word in words:
>>                         spell_check_dict.add_to_session(word)
>>
>>
>> Of course, looks like we would want to add other things like readthedocs,
>> dirhtml, etc.  Or possibly skip checking words that are proper nouns
>> (like ReadTheDocs would be).  Codespell ignores words it doesn't have a
>> correction / suggestion for, while enchant will assume something it
>> doesn't know is misspelled.  This makes it so we will flag more often.
>>
>> WDYT?
>
>
> This looks good.
> We shouldn't lag more often than now as we only add more familiar words
> to enchant session.
>
> Do you want me to create a proper patch and send it or you will do it
> when taking the 2nd patch?

I guess, you sending a new patch would be preferred as it can be reviewed by 
others.

//Eelco

>>
>>> --- a/utilities/checkpatch.py
>>> +++ b/utilities/checkpatch.py
>>> @@ -37,8 +37,19 @@ spell_check_dict = None
>>>
>>>
>>>  def open_spell_check_dict():
>>> +    import tempfile
>>>      import enchant
>>>
>>> +    try:
>>> +        import codespell_lib
>>> +        codespell_dir = os.path.dirname(codespell_lib.__file__)
>>> +        codespell_file = os.path.join(codespell_dir, 'data', 
>>> 'dictionary.txt')
>>> +        if not os.path.exists(codespell_file):
>>> +            codespell_file = ''
>>> +    except:
>>> +        codespell_file = ''
>>> +
>>> +
>>>      try:
>>>          extra_keywords = ['ovs', 'vswitch', 'vswitchd', 'ovs-vswitchd',
>>>                            'netdev', 'selinux', 'ovs-ctl', 'dpctl', 'ofctl',
>>> @@ -93,7 +104,19 @@ def open_spell_check_dict():
>>>          global spell_check_dict
>>>          word_dic = os.path.join(os.path.dirname(os.path.abspath(__file__)),
>>>                                                  'dictionary.txt')
>>> -        spell_check_dict = enchant.DictWithPWL('en_US', word_dic)
>>> +
>>> +        if codespell_file:
>>> +            with open(codespell_file) as f:
>>> +                with tempfile.NamedTemporaryFile(mode='w') as f2:
>>> +                    for line in f.readlines():
>>> +                        words = line.strip().split('>')[1].strip(', 
>>> ').split(',')
>>> +                        for word in words:
>>> +                            f2.write(word.strip()+'\n')
>>> +
>>> +                    spell_check_dict = enchant.DictWithPWL('en_US', 
>>> f2.name)
>>> +        else:
>>> +            spell_check_dict = enchant.Dict("en_US")
>>> +
>>>          for kw in extra_keywords:
>>>              spell_check_dict.add_to_session(kw)
>>>
>>>
>>>
>>>
>>> Example run with time, ignore the warning
>>>
>>> $ time python3 utilities/checkpatch.py -1 -S
>>> == Checking dcfe1402d26e ("checkpatch: Fix personal word list storage.") ==
>>> ERROR: Committer Roi Dayan <[email protected]> needs to sign off.
>>> Lines checked: 31, Warnings: 0, Errors: 1
>>>
>>>
>>> real    0m0.517s
>>> user    0m0.304s
>>> sys     0m0.067s
>>>
>>>
>>>
>>>>> Also adding the entire dictionary the script is even faster than
>>>>> adding word by word as done now.
>>>>>
>>>>> I think maybe removing the add word by word part at all but checking and 
>>>>> doing
>>>>> in steps.
>>>>>
>>>>> Just by adding the dictionary and having the words being added through 
>>>>> python
>>>>> already exists seems to be faster.
>>>>>
>>>>> Checking small commit before loading dictionry.txt:
>>>>>
>>>>> $ time ./utilities/checkpatch.py  -S -1
>>>>>
>>>>> real    0m28.379s
>>>>> user    0m0.272s
>>>>> sys     0m0.223s
>>>>>
>>>>>
>>>>> and after:
>>>>>
>>>>> $ time ./utilities/checkpatch.py  -S -1
>>>>>
>>>>> real    0m0.238s
>>>>> user    0m0.138s
>>>>> sys     0m0.038s
>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Eelco
>>>>>>
>>>>>>> ---
>>>>>>>  utilities/automake.mk    |     1 +
>>>>>>>  utilities/checkpatch.py  |     4 +-
>>>>>>>  utilities/dictionary.txt | 16161 +++++++++++++++++++++++++++++++++++++
>>>>>>>  3 files changed, 16165 insertions(+), 1 deletion(-)
>>>>>>>  create mode 100644 utilities/dictionary.txt
>>>>>
>>>>> _______________________________________________
>>>>> dev mailing list
>>>>> [email protected]
>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>>
>>

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to