Roi Dayan <[email protected]> writes:

> On 08/11/2023 21:04, Aaron Conole wrote:
>> Roi Dayan via dev <[email protected]> writes:
>> 
>>> On 02/11/2023 16:11, Eelco Chaudron wrote:
>>>>
>>>> On 2 Nov 2023, at 14:20, Roi Dayan via dev wrote:
>>>>
>>>>> Add personal words list as spellcheck.txt and load it
>>>>> into enchant spell checker. This file is generated from
>>>>> codespell dictionary.txt and contains words users use
>>>>> but enchant spell checker failed on like
>>>>> refcount, pthread, enqueuing, etc.
>>>>>
>>>>> Signed-off-by: Roi Dayan <[email protected]>
>>>> Thanks for the patch, but it doesn’t look right to add the full list
>>>> of words to the OVS repository.
>>>>
>>>> Maybe we can update the extra_keywords list with the most common
>>>> missing ones, and add a command line option to include a
>>>> user-defined file for people who want this?
>>>>
>>>> What do you think?
>>>>
>>>
>>> I think it is needed. It's a dictionary of most commonly used words
>>> and the enchant spell check does not seem to be enough.
>>> Some examples enchant fails as I remember are:
>>> lacp, dereferenced, valgrind, priv, syscall,..
>> 
>> Well, we do add some of those - BUT I see that codespell probably has a
>> more complete dictionary.
>> 
>> The original implementation using enchant was due to there not being a
>> clear winner at the time.  Enchant is a spell checking frontend intended
>> for lots of tools, so seemed like a good idea (for example, used by
>> libreoffice, AbiWord, and others).
>> 
>> It may be that codepspell is more appropriate since it is targeted at
>> development spell checking.  OTOH, codespell will miss lots of
>> misspellings where enchant will have a larger lexicon.  I have no real
>> opinion on which framework makes sense - but we shouldn't include a
>> dictionary.  After all, even linux checkpatch.pl will find the codespell
>> dictionary and just use that as it exists.
>> 
>> However, I will point out that there *is* a difference between the two.
>> Here's a simple example:
>> 
>>   02:01:19 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ echo vailgrind | 
>> codespell -
>>   02:01:24 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ 
>> 
>> vs.
>> 
>>   02:00:18 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$
>> ./utilities/checkpatch.py -S test.patch
>>   == Checking "test.patch" ==
>>   WARNING: Possible misspelled word: "vaigrind"
>>   Did you mean:  ['grinder', 'valgrind']
>> 
>> So I guess we can probably do something better - and maybe that
>> something is to find the codespell dictionary cut it up and merge the
>> values with the current session (instead of add(), ie: 2/2 in this
>> series).
>> 
>> But I don't think we need to copy the entire codespell dict.
>> 
>
> In linux checkpatch find codespell dictonary.txt, loads it into memory and
> cut on '>' to get the word needed and looks like internal perl code does
> the spell check.
>
> I can do the same here to find codespell dictionary.txt and cut on '>'
> and I get the same dictionary I tried to add here but I need to create
> a temporary file to load it with enchant.DictWithPWL()
>
> As doing a loop of calls to spell_check_dict.add_to_session() seems
> to take a lot longer for the library by far as shown.
>
> Will that be ok? 
>
> This is the example diff. I can send as an update patch if its ok.
> Looking for the codespell dictionary as it can be different folder between
> fedora and ubuntu for example.
> Then loading it and creating temporary file to load into enchant which
> is being removed when file is closed.

I think it's a good idea to use the codespell dict, but it turns out we
don't actually need the temp file.  I've tested with the add_to_session
patch, and here is my timing output:

09:19:43 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ time 
./utilities/checkpatch.py -S -1 test.patch 
== Checking 594d145410c5 ("readthedocs: Use dirhtml builder.") ==
WARNING: Possible misspelled word: "readthedocs"
Did you mean:  ['headteachers']
WARNING: Possible misspelled word: "dirhtml"
Did you mean:  ['dirtily']
Subject: readthedocs: Use dirhtml builder.
WARNING: Possible misspelled word: "ReadTheDocs"
Did you mean:  ['Headteachers']
Lines checked: 44, Warnings: 3, Errors: 0


real    0m0.369s
user    0m0.344s
sys     0m0.024s

The difference being that I don't use a tempfile - just your delta above
as:

        if codespell_file != '':
            with open(codespell_file) as f:
                for line in f.readlines():
                    words = line.strip().split('>')[1].strip(', ').split(',')
                    for word in words:
                        spell_check_dict.add_to_session(word)


Of course, looks like we would want to add other things like readthedocs,
dirhtml, etc.  Or possibly skip checking words that are proper nouns
(like ReadTheDocs would be).  Codespell ignores words it doesn't have a
correction / suggestion for, while enchant will assume something it
doesn't know is misspelled.  This makes it so we will flag more often.

WDYT?

> --- a/utilities/checkpatch.py
> +++ b/utilities/checkpatch.py
> @@ -37,8 +37,19 @@ spell_check_dict = None
>  
>  
>  def open_spell_check_dict():
> +    import tempfile
>      import enchant
>  
> +    try:
> +        import codespell_lib
> +        codespell_dir = os.path.dirname(codespell_lib.__file__)
> +        codespell_file = os.path.join(codespell_dir, 'data', 
> 'dictionary.txt')
> +        if not os.path.exists(codespell_file):
> +            codespell_file = ''
> +    except:
> +        codespell_file = ''
> +
> +
>      try:
>          extra_keywords = ['ovs', 'vswitch', 'vswitchd', 'ovs-vswitchd',
>                            'netdev', 'selinux', 'ovs-ctl', 'dpctl', 'ofctl',
> @@ -93,7 +104,19 @@ def open_spell_check_dict():
>          global spell_check_dict
>          word_dic = os.path.join(os.path.dirname(os.path.abspath(__file__)),
>                                                  'dictionary.txt')
> -        spell_check_dict = enchant.DictWithPWL('en_US', word_dic)
> +
> +        if codespell_file:
> +            with open(codespell_file) as f:
> +                with tempfile.NamedTemporaryFile(mode='w') as f2:
> +                    for line in f.readlines():
> +                        words = line.strip().split('>')[1].strip(', 
> ').split(',')
> +                        for word in words:
> +                            f2.write(word.strip()+'\n')
> +
> +                    spell_check_dict = enchant.DictWithPWL('en_US', f2.name)
> +        else:
> +            spell_check_dict = enchant.Dict("en_US")
> +
>          for kw in extra_keywords:
>              spell_check_dict.add_to_session(kw)
>
>
>
>
> Example run with time, ignore the warning
>
> $ time python3 utilities/checkpatch.py -1 -S
> == Checking dcfe1402d26e ("checkpatch: Fix personal word list storage.") ==
> ERROR: Committer Roi Dayan <[email protected]> needs to sign off.
> Lines checked: 31, Warnings: 0, Errors: 1
>
>
> real    0m0.517s
> user    0m0.304s
> sys     0m0.067s
>
>
>
>>> Also adding the entire dictionary the script is even faster than
>>> adding word by word as done now.
>>>
>>> I think maybe removing the add word by word part at all but checking and 
>>> doing
>>> in steps.
>>>
>>> Just by adding the dictionary and having the words being added through 
>>> python
>>> already exists seems to be faster.
>>>
>>> Checking small commit before loading dictionry.txt:
>>>
>>> $ time ./utilities/checkpatch.py  -S -1
>>>
>>> real    0m28.379s
>>> user    0m0.272s
>>> sys     0m0.223s
>>>
>>>
>>> and after:
>>>
>>> $ time ./utilities/checkpatch.py  -S -1
>>>
>>> real    0m0.238s
>>> user    0m0.138s
>>> sys     0m0.038s
>>>
>>>> Cheers,
>>>>
>>>> Eelco
>>>>
>>>>> ---
>>>>>  utilities/automake.mk    |     1 +
>>>>>  utilities/checkpatch.py  |     4 +-
>>>>>  utilities/dictionary.txt | 16161 +++++++++++++++++++++++++++++++++++++
>>>>>  3 files changed, 16165 insertions(+), 1 deletion(-)
>>>>>  create mode 100644 utilities/dictionary.txt
>>>
>>> _______________________________________________
>>> dev mailing list
>>> [email protected]
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to