Splendid! Let's get hacking! On Tuesday, July 08, 2014 17:13:14 Martin Sandsmark wrote: > On Tuesday 8. July 2014 04.29.57 Karthik Periagaram wrote: > > I reported this over the weekend, having encountered it all through my grad > > years when having automatic spell check in katepart would highlight > > "spelling errors" all over data files. So, the goal is to make sonnet > > smarter and avoid spell checking numbers, generally speaking. > > Well, I can't really think of any "correct" words with numbers in them, so > why > not just check if the word contains any digits? Or maybe just make the > tokenizer split on anything that QChar::isLetter() returns false for?
I see a problem with that approach. Splitting on any non-letter character would mean something like "cat,1" will pass spell check just because cat is spelled correctly. I think it should fail the spell check, don't you? So, I think I have a working fix. To play with my ideas, I've been using this test code: http://paste.kde.org/p5vt7oqht As you can see, the main function here uses a test string containing some alphanumeric text which it is supposed to split into pieces, then decide if each of those pieces is a valid word or not. A valid word is one that will be sent to the spell checker. An invalid word will not be sent and effectively passes the spell check. The intent is that numbers should be treated as invalid words. Everything else gets spell checked. The first big change I would like to propose is to use the Line instead of Word for the QTextBoundaryFinder. Using Word breaks 1.0e+1 into three words: 1.0e, + and 1. Using Line on the other hand finds boundaries only at spaces and hyphenation points (Qt assistant says, places where text can break into multiple lines). You can run this program with both Line and Word and see the difference in the output. I think you'll agree that Line is very much the behavior we want. Quick CMakeLists.txt file to build the file above (I named the cpp file as test.cpp): http://paste.kde.org/pecdtlxcu The findNextWord() function is simple enough. It just finds the next word and returns a boolean saying true or false (when it didn't find any more words). The next change I want to propose is in the isValidWord() function. Here, I quickly check if this word can convert to a double. This catches the exponential notation, which would fail the subsequent test. Else, I check for the presence of at least one letter in the word. If so, it will be sent to the spell checker. I'm not testing for an empty string in this test code, but that should probably be retained in the sonnet code to handle the empty buffer case. The output of the program is basically each word identified by QTextBoundaryFinder followed by whether it is a valid word or not. So, this is my proposed algorithm for the Filter class to parse the text and send words to the spell checker. What do you think? Do you see any obvious flaws in my assumptions? Longer term, I think a test string like the one I used makes for an excellent unit test (or tests) for sonnet to check its behavior against numbers. Shorter term, I'll prep a patch for review, if you are okay with this approach. > > If either are actively maintaining sonnet, I'd love to pepper you with more > > questions! > > Feel free! > > And please do feel free to pepper me on IRC as well, I'm "sandsmark" there. Pepper inbound! But seriously, I think I'll need to put this off till the weekend when I'll have the time to study it carefully. I mostly wanted to clarify the execution path and confirm I understand how each class interacts with each other (and maybe discuss the merits/shortcomings of the design). Also, would you suggest we do that off the mailing list, or would it be preferable to have it officially documented (maybe for the benefit of other potential contributors)? > And awesome that someone else wants to contribute to Sonnet, this warms my > heart. :-) K.
signature.asc
Description: This is a digitally signed message part.
>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<