Re: [Sonnet] Introduction and request for guidance

Karthik Periagaram Tue, 08 Jul 2014 20:37:27 -0700

Splendid! Let's get hacking!

On Tuesday, July 08, 2014 17:13:14 Martin Sandsmark wrote:
> On Tuesday 8. July 2014 04.29.57 Karthik Periagaram wrote:
> > I reported this over the weekend, having encountered it all through my grad
> > years when having automatic spell check in katepart would highlight
> > "spelling errors" all over data files. So, the goal is to make sonnet
> > smarter and avoid spell checking numbers, generally speaking.
> 
> Well, I can't really think of any "correct" words with numbers in them, so 
> why 
> not just check if the word contains any digits? Or maybe just make the 
> tokenizer split on anything that QChar::isLetter() returns false for?


I see a problem with that approach. Splitting on any non-letter character would 
mean something like "cat,1" will pass spell check just because cat is spelled 
correctly. I think it should fail the spell check, don't you?

So, I think I have a working fix. To play with my ideas, I've been using this 
test code:
http://paste.kde.org/p5vt7oqht

As you can see, the main function here uses a test string containing some 
alphanumeric text which it is supposed to split into pieces, then decide if 
each of those pieces is a valid word or not. A valid word is one that will be 
sent to the spell checker. An invalid word will not be sent and effectively 
passes the spell check. The intent is that numbers should be treated as invalid 
words. Everything else gets spell checked.

The first big change I would like to propose is to use the  Line instead of 
Word for the QTextBoundaryFinder. Using Word breaks 1.0e+1 into three words: 
1.0e, + and 1. Using Line on the other hand finds boundaries only at spaces and 
hyphenation points (Qt assistant says, places where text can break into 
multiple lines). You can run this program with both Line and Word and see the 
difference in the output. I think you'll agree that Line is very much the 
behavior we want.

Quick CMakeLists.txt file to build the file above (I named the cpp file as 
test.cpp):
http://paste.kde.org/pecdtlxcu

The findNextWord() function is simple enough. It just finds the next word and 
returns a boolean saying true or false (when it didn't find any more words). 
The next change I want to propose is in the isValidWord() function. Here, I 
quickly check if this word can convert to a double. This catches the 
exponential notation, which would fail the subsequent test. Else, I check for 
the presence of at least one letter in the word. If so, it will be sent to the 
spell checker. I'm not testing for an empty string in this test code, but that 
should probably be retained in the sonnet code to handle the empty buffer case.

The output of the program is basically each word identified by 
QTextBoundaryFinder followed by whether it is a valid word or not.

So, this is my proposed algorithm for the Filter class to parse the text and 
send words to the spell checker. What do you think? Do you see any obvious 
flaws in my assumptions?

Longer term, I think a test string like the one I used makes for an excellent 
unit test (or tests) for sonnet to check its behavior against numbers. Shorter 
term, I'll prep a patch for review, if you are okay with this approach.

> > If either are actively maintaining sonnet, I'd love to pepper you with more
> > questions!
> 
> Feel free!
> 
> And please do feel free to pepper me on IRC as well, I'm "sandsmark" there. 

Pepper inbound! But seriously, I think I'll need to put this off till the 
weekend when I'll have the time to study it carefully. I mostly wanted to 
clarify the execution path and confirm I understand how each class interacts 
with each other (and maybe discuss the merits/shortcomings of the design). 
Also, would you suggest we do that off the mailing list, or would it be 
preferable to have it officially documented (maybe for the benefit of other 
potential contributors)?

> And awesome that someone else wants to contribute to Sonnet, this warms my 
> heart. :-)

K.

signature.asc
Description: This is a digitally signed message part.

>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<

Re: [Sonnet] Introduction and request for guidance

Reply via email to