On 10/11/2023 16:23, Aaron Conole wrote:
> Roi Dayan <[email protected]> writes:
>
>> On 08/11/2023 21:04, Aaron Conole wrote:
>>> Roi Dayan via dev <[email protected]> writes:
>>>
>>>> On 02/11/2023 16:11, Eelco Chaudron wrote:
>>>>>
>>>>> On 2 Nov 2023, at 14:20, Roi Dayan via dev wrote:
>>>>>
>>>>>> Add personal words list as spellcheck.txt and load it
>>>>>> into enchant spell checker. This file is generated from
>>>>>> codespell dictionary.txt and contains words users use
>>>>>> but enchant spell checker failed on like
>>>>>> refcount, pthread, enqueuing, etc.
>>>>>>
>>>>>> Signed-off-by: Roi Dayan <[email protected]>
>>>>> Thanks for the patch, but it doesn’t look right to add the full list
>>>>> of words to the OVS repository.
>>>>>
>>>>> Maybe we can update the extra_keywords list with the most common
>>>>> missing ones, and add a command line option to include a
>>>>> user-defined file for people who want this?
>>>>>
>>>>> What do you think?
>>>>>
>>>>
>>>> I think it is needed. It's a dictionary of most commonly used words
>>>> and the enchant spell check does not seem to be enough.
>>>> Some examples enchant fails as I remember are:
>>>> lacp, dereferenced, valgrind, priv, syscall,..
>>>
>>> Well, we do add some of those - BUT I see that codespell probably has a
>>> more complete dictionary.
>>>
>>> The original implementation using enchant was due to there not being a
>>> clear winner at the time. Enchant is a spell checking frontend intended
>>> for lots of tools, so seemed like a good idea (for example, used by
>>> libreoffice, AbiWord, and others).
>>>
>>> It may be that codepspell is more appropriate since it is targeted at
>>> development spell checking. OTOH, codespell will miss lots of
>>> misspellings where enchant will have a larger lexicon. I have no real
>>> opinion on which framework makes sense - but we shouldn't include a
>>> dictionary. After all, even linux checkpatch.pl will find the codespell
>>> dictionary and just use that as it exists.
>>>
>>> However, I will point out that there *is* a difference between the two.
>>> Here's a simple example:
>>>
>>> 02:01:19 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ echo vailgrind
>>> | codespell -
>>> 02:01:24 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$
>>>
>>> vs.
>>>
>>> 02:00:18 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$
>>> ./utilities/checkpatch.py -S test.patch
>>> == Checking "test.patch" ==
>>> WARNING: Possible misspelled word: "vaigrind"
>>> Did you mean: ['grinder', 'valgrind']
>>>
>>> So I guess we can probably do something better - and maybe that
>>> something is to find the codespell dictionary cut it up and merge the
>>> values with the current session (instead of add(), ie: 2/2 in this
>>> series).
>>>
>>> But I don't think we need to copy the entire codespell dict.
>>>
>>
>> In linux checkpatch find codespell dictonary.txt, loads it into memory and
>> cut on '>' to get the word needed and looks like internal perl code does
>> the spell check.
>>
>> I can do the same here to find codespell dictionary.txt and cut on '>'
>> and I get the same dictionary I tried to add here but I need to create
>> a temporary file to load it with enchant.DictWithPWL()
>>
>> As doing a loop of calls to spell_check_dict.add_to_session() seems
>> to take a lot longer for the library by far as shown.
>>
>> Will that be ok?
>>
>> This is the example diff. I can send as an update patch if its ok.
>> Looking for the codespell dictionary as it can be different folder between
>> fedora and ubuntu for example.
>> Then loading it and creating temporary file to load into enchant which
>> is being removed when file is closed.
>
> I think it's a good idea to use the codespell dict, but it turns out we
> don't actually need the temp file. I've tested with the add_to_session
> patch, and here is my timing output:
>
> 09:19:43 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ time
> ./utilities/checkpatch.py -S -1 test.patch
> == Checking 594d145410c5 ("readthedocs: Use dirhtml builder.") ==
> WARNING: Possible misspelled word: "readthedocs"
> Did you mean: ['headteachers']
> WARNING: Possible misspelled word: "dirhtml"
> Did you mean: ['dirtily']
> Subject: readthedocs: Use dirhtml builder.
> WARNING: Possible misspelled word: "ReadTheDocs"
> Did you mean: ['Headteachers']
> Lines checked: 44, Warnings: 3, Errors: 0
>
>
> real 0m0.369s
> user 0m0.344s
> sys 0m0.024s
>
> The difference being that I don't use a tempfile - just your delta above
> as:
>
> if codespell_file != '':
> with open(codespell_file) as f:
> for line in f.readlines():
> words = line.strip().split('>')[1].strip(', ').split(',')
> for word in words:
> spell_check_dict.add_to_session(word)
>
>
> Of course, looks like we would want to add other things like readthedocs,
> dirhtml, etc. Or possibly skip checking words that are proper nouns
> (like ReadTheDocs would be). Codespell ignores words it doesn't have a
> correction / suggestion for, while enchant will assume something it
> doesn't know is misspelled. This makes it so we will flag more often.
>
> WDYT?
This looks good.
We shouldn't lag more often than now as we only add more familiar words
to enchant session.
Do you want me to create a proper patch and send it or you will do it
when taking the 2nd patch?
>
>> --- a/utilities/checkpatch.py
>> +++ b/utilities/checkpatch.py
>> @@ -37,8 +37,19 @@ spell_check_dict = None
>>
>>
>> def open_spell_check_dict():
>> + import tempfile
>> import enchant
>>
>> + try:
>> + import codespell_lib
>> + codespell_dir = os.path.dirname(codespell_lib.__file__)
>> + codespell_file = os.path.join(codespell_dir, 'data',
>> 'dictionary.txt')
>> + if not os.path.exists(codespell_file):
>> + codespell_file = ''
>> + except:
>> + codespell_file = ''
>> +
>> +
>> try:
>> extra_keywords = ['ovs', 'vswitch', 'vswitchd', 'ovs-vswitchd',
>> 'netdev', 'selinux', 'ovs-ctl', 'dpctl', 'ofctl',
>> @@ -93,7 +104,19 @@ def open_spell_check_dict():
>> global spell_check_dict
>> word_dic = os.path.join(os.path.dirname(os.path.abspath(__file__)),
>> 'dictionary.txt')
>> - spell_check_dict = enchant.DictWithPWL('en_US', word_dic)
>> +
>> + if codespell_file:
>> + with open(codespell_file) as f:
>> + with tempfile.NamedTemporaryFile(mode='w') as f2:
>> + for line in f.readlines():
>> + words = line.strip().split('>')[1].strip(',
>> ').split(',')
>> + for word in words:
>> + f2.write(word.strip()+'\n')
>> +
>> + spell_check_dict = enchant.DictWithPWL('en_US', f2.name)
>> + else:
>> + spell_check_dict = enchant.Dict("en_US")
>> +
>> for kw in extra_keywords:
>> spell_check_dict.add_to_session(kw)
>>
>>
>>
>>
>> Example run with time, ignore the warning
>>
>> $ time python3 utilities/checkpatch.py -1 -S
>> == Checking dcfe1402d26e ("checkpatch: Fix personal word list storage.") ==
>> ERROR: Committer Roi Dayan <[email protected]> needs to sign off.
>> Lines checked: 31, Warnings: 0, Errors: 1
>>
>>
>> real 0m0.517s
>> user 0m0.304s
>> sys 0m0.067s
>>
>>
>>
>>>> Also adding the entire dictionary the script is even faster than
>>>> adding word by word as done now.
>>>>
>>>> I think maybe removing the add word by word part at all but checking and
>>>> doing
>>>> in steps.
>>>>
>>>> Just by adding the dictionary and having the words being added through
>>>> python
>>>> already exists seems to be faster.
>>>>
>>>> Checking small commit before loading dictionry.txt:
>>>>
>>>> $ time ./utilities/checkpatch.py -S -1
>>>>
>>>> real 0m28.379s
>>>> user 0m0.272s
>>>> sys 0m0.223s
>>>>
>>>>
>>>> and after:
>>>>
>>>> $ time ./utilities/checkpatch.py -S -1
>>>>
>>>> real 0m0.238s
>>>> user 0m0.138s
>>>> sys 0m0.038s
>>>>
>>>>> Cheers,
>>>>>
>>>>> Eelco
>>>>>
>>>>>> ---
>>>>>> utilities/automake.mk | 1 +
>>>>>> utilities/checkpatch.py | 4 +-
>>>>>> utilities/dictionary.txt | 16161 +++++++++++++++++++++++++++++++++++++
>>>>>> 3 files changed, 16165 insertions(+), 1 deletion(-)
>>>>>> create mode 100644 utilities/dictionary.txt
>>>>
>>>> _______________________________________________
>>>> dev mailing list
>>>> [email protected]
>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev