On 12 Nov 2023, at 8:22, Roi Dayan wrote:
> On 10/11/2023 16:23, Aaron Conole wrote: >> Roi Dayan <[email protected]> writes: >> >>> On 08/11/2023 21:04, Aaron Conole wrote: >>>> Roi Dayan via dev <[email protected]> writes: >>>> >>>>> On 02/11/2023 16:11, Eelco Chaudron wrote: >>>>>> >>>>>> On 2 Nov 2023, at 14:20, Roi Dayan via dev wrote: >>>>>> >>>>>>> Add personal words list as spellcheck.txt and load it >>>>>>> into enchant spell checker. This file is generated from >>>>>>> codespell dictionary.txt and contains words users use >>>>>>> but enchant spell checker failed on like >>>>>>> refcount, pthread, enqueuing, etc. >>>>>>> >>>>>>> Signed-off-by: Roi Dayan <[email protected]> >>>>>> Thanks for the patch, but it doesn’t look right to add the full list >>>>>> of words to the OVS repository. >>>>>> >>>>>> Maybe we can update the extra_keywords list with the most common >>>>>> missing ones, and add a command line option to include a >>>>>> user-defined file for people who want this? >>>>>> >>>>>> What do you think? >>>>>> >>>>> >>>>> I think it is needed. It's a dictionary of most commonly used words >>>>> and the enchant spell check does not seem to be enough. >>>>> Some examples enchant fails as I remember are: >>>>> lacp, dereferenced, valgrind, priv, syscall,.. >>>> >>>> Well, we do add some of those - BUT I see that codespell probably has a >>>> more complete dictionary. >>>> >>>> The original implementation using enchant was due to there not being a >>>> clear winner at the time. Enchant is a spell checking frontend intended >>>> for lots of tools, so seemed like a good idea (for example, used by >>>> libreoffice, AbiWord, and others). >>>> >>>> It may be that codepspell is more appropriate since it is targeted at >>>> development spell checking. OTOH, codespell will miss lots of >>>> misspellings where enchant will have a larger lexicon. I have no real >>>> opinion on which framework makes sense - but we shouldn't include a >>>> dictionary. After all, even linux checkpatch.pl will find the codespell >>>> dictionary and just use that as it exists. >>>> >>>> However, I will point out that there *is* a difference between the two. >>>> Here's a simple example: >>>> >>>> 02:01:19 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ echo vailgrind >>>> | codespell - >>>> 02:01:24 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ >>>> >>>> vs. >>>> >>>> 02:00:18 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ >>>> ./utilities/checkpatch.py -S test.patch >>>> == Checking "test.patch" == >>>> WARNING: Possible misspelled word: "vaigrind" >>>> Did you mean: ['grinder', 'valgrind'] >>>> >>>> So I guess we can probably do something better - and maybe that >>>> something is to find the codespell dictionary cut it up and merge the >>>> values with the current session (instead of add(), ie: 2/2 in this >>>> series). >>>> >>>> But I don't think we need to copy the entire codespell dict. >>>> >>> >>> In linux checkpatch find codespell dictonary.txt, loads it into memory and >>> cut on '>' to get the word needed and looks like internal perl code does >>> the spell check. >>> >>> I can do the same here to find codespell dictionary.txt and cut on '>' >>> and I get the same dictionary I tried to add here but I need to create >>> a temporary file to load it with enchant.DictWithPWL() >>> >>> As doing a loop of calls to spell_check_dict.add_to_session() seems >>> to take a lot longer for the library by far as shown. >>> >>> Will that be ok? >>> >>> This is the example diff. I can send as an update patch if its ok. >>> Looking for the codespell dictionary as it can be different folder between >>> fedora and ubuntu for example. >>> Then loading it and creating temporary file to load into enchant which >>> is being removed when file is closed. >> >> I think it's a good idea to use the codespell dict, but it turns out we >> don't actually need the temp file. I've tested with the add_to_session >> patch, and here is my timing output: >> >> 09:19:43 aconole@RHTPC1VM0NT {(594d145410...)} ~/git/ovs$ time >> ./utilities/checkpatch.py -S -1 test.patch >> == Checking 594d145410c5 ("readthedocs: Use dirhtml builder.") == >> WARNING: Possible misspelled word: "readthedocs" >> Did you mean: ['headteachers'] >> WARNING: Possible misspelled word: "dirhtml" >> Did you mean: ['dirtily'] >> Subject: readthedocs: Use dirhtml builder. >> WARNING: Possible misspelled word: "ReadTheDocs" >> Did you mean: ['Headteachers'] >> Lines checked: 44, Warnings: 3, Errors: 0 >> >> >> real 0m0.369s >> user 0m0.344s >> sys 0m0.024s >> >> The difference being that I don't use a tempfile - just your delta above >> as: >> >> if codespell_file != '': >> with open(codespell_file) as f: >> for line in f.readlines(): >> words = line.strip().split('>')[1].strip(', ').split(',') >> for word in words: >> spell_check_dict.add_to_session(word) >> >> >> Of course, looks like we would want to add other things like readthedocs, >> dirhtml, etc. Or possibly skip checking words that are proper nouns >> (like ReadTheDocs would be). Codespell ignores words it doesn't have a >> correction / suggestion for, while enchant will assume something it >> doesn't know is misspelled. This makes it so we will flag more often. >> >> WDYT? > > > This looks good. > We shouldn't lag more often than now as we only add more familiar words > to enchant session. > > Do you want me to create a proper patch and send it or you will do it > when taking the 2nd patch? I guess, you sending a new patch would be preferred as it can be reviewed by others. //Eelco >> >>> --- a/utilities/checkpatch.py >>> +++ b/utilities/checkpatch.py >>> @@ -37,8 +37,19 @@ spell_check_dict = None >>> >>> >>> def open_spell_check_dict(): >>> + import tempfile >>> import enchant >>> >>> + try: >>> + import codespell_lib >>> + codespell_dir = os.path.dirname(codespell_lib.__file__) >>> + codespell_file = os.path.join(codespell_dir, 'data', >>> 'dictionary.txt') >>> + if not os.path.exists(codespell_file): >>> + codespell_file = '' >>> + except: >>> + codespell_file = '' >>> + >>> + >>> try: >>> extra_keywords = ['ovs', 'vswitch', 'vswitchd', 'ovs-vswitchd', >>> 'netdev', 'selinux', 'ovs-ctl', 'dpctl', 'ofctl', >>> @@ -93,7 +104,19 @@ def open_spell_check_dict(): >>> global spell_check_dict >>> word_dic = os.path.join(os.path.dirname(os.path.abspath(__file__)), >>> 'dictionary.txt') >>> - spell_check_dict = enchant.DictWithPWL('en_US', word_dic) >>> + >>> + if codespell_file: >>> + with open(codespell_file) as f: >>> + with tempfile.NamedTemporaryFile(mode='w') as f2: >>> + for line in f.readlines(): >>> + words = line.strip().split('>')[1].strip(', >>> ').split(',') >>> + for word in words: >>> + f2.write(word.strip()+'\n') >>> + >>> + spell_check_dict = enchant.DictWithPWL('en_US', >>> f2.name) >>> + else: >>> + spell_check_dict = enchant.Dict("en_US") >>> + >>> for kw in extra_keywords: >>> spell_check_dict.add_to_session(kw) >>> >>> >>> >>> >>> Example run with time, ignore the warning >>> >>> $ time python3 utilities/checkpatch.py -1 -S >>> == Checking dcfe1402d26e ("checkpatch: Fix personal word list storage.") == >>> ERROR: Committer Roi Dayan <[email protected]> needs to sign off. >>> Lines checked: 31, Warnings: 0, Errors: 1 >>> >>> >>> real 0m0.517s >>> user 0m0.304s >>> sys 0m0.067s >>> >>> >>> >>>>> Also adding the entire dictionary the script is even faster than >>>>> adding word by word as done now. >>>>> >>>>> I think maybe removing the add word by word part at all but checking and >>>>> doing >>>>> in steps. >>>>> >>>>> Just by adding the dictionary and having the words being added through >>>>> python >>>>> already exists seems to be faster. >>>>> >>>>> Checking small commit before loading dictionry.txt: >>>>> >>>>> $ time ./utilities/checkpatch.py -S -1 >>>>> >>>>> real 0m28.379s >>>>> user 0m0.272s >>>>> sys 0m0.223s >>>>> >>>>> >>>>> and after: >>>>> >>>>> $ time ./utilities/checkpatch.py -S -1 >>>>> >>>>> real 0m0.238s >>>>> user 0m0.138s >>>>> sys 0m0.038s >>>>> >>>>>> Cheers, >>>>>> >>>>>> Eelco >>>>>> >>>>>>> --- >>>>>>> utilities/automake.mk | 1 + >>>>>>> utilities/checkpatch.py | 4 +- >>>>>>> utilities/dictionary.txt | 16161 +++++++++++++++++++++++++++++++++++++ >>>>>>> 3 files changed, 16165 insertions(+), 1 deletion(-) >>>>>>> create mode 100644 utilities/dictionary.txt >>>>> >>>>> _______________________________________________ >>>>> dev mailing list >>>>> [email protected] >>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>>> >> _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
