----- Original Message ----- From: "Michael Bunk" <[EMAIL PROTECTED]> To: "Gary Setter" <[EMAIL PROTECTED]> Cc: <[email protected]> Sent: Sunday, August 14, 2005 10:06 PM Subject: Re: [Aspell-user] aspell_speller_add_to_personal() doesn't accept hyphens
> Hi Gary, > > thanks for your reply, though you didn't make it too easy for me :) > > > Why are you using such an early version? I had issues with that > > aspect of aspell some time ago. I thought they were resolved. > > I have tried 0.60.3 now, but the behaviour is the same. > > > You can see what I did in this project: > > http://sourceforge.net/projects/descdatadiary/ > > It is a windows project. It does not try to do a Unix install, > > but you may find what I did in speller_impl.cpp interesting. > > I have seen that you modified > aspell-0.60.1-win32/modules/speller/default/speller_impl.cpp > by implementing 3 new functions: > > int aspell_speller_word_seperator_length(speller, char *) > It returns the number of bytes till the next word character, using the aspell > internal function !lang_->is_alpha(). > > int aspell_speller_word_length(speller, char *) > It returns the number of bytes till the next non-word character, using > lang_->is_alpha() as well. > > aspell_speller_add_lower_to_personal() > This adds a lowercased version of the given string to the personal word list. > I guess you implemented this for capitalized words at sentence starts? > > The problem I see with this approach is that you modified aspell internal > functions. But since I want to use aspell as a library, such modifications > are ruled out. > > While looking through the code I found that aspell implements a Tokenizer > class, which seems to be designed to do the same. It is not exported, but > it is used by the DocumentChecker class. Maybe I should try to use that? > > But its documentation in aspell.h is confusing (besides being misspelled :): > > /* process a string > * The string passed in should only be split on white space > * characters. Furthermore, between calles to reset, each string > * should be passed in exactly once and in the order they appeared > * in the document. Passing in stings out of order, skipping > * strings or passing them in more than once may lead to undefined > * results. */ > void aspell_document_checker_process(struct AspellDocumentChecker * ths, const > char * str, int size) > > Does it mean I have to split my string to be checked at white space before > passing in the pieces to this function? Or does it mean that this function > usually only splits at white space? > > Kindest regards, > Michael Hello Michael, I'm sorry that you had difficulty with my code. Maybe we can both learn from this. The new functions that you mentioned (aspell_speller_word_seperator_length et. el.) are for different problems. Also, I'm not sure if I will submit them to the aspell project. I haven't been using the DocumentChecker class, maybe I should be. What I do want is a word tokenizer that is aware of the character set and uses the Language classes 'classification' functions (e.g. is_alpha) What I suspect is that there are two stages to tokenizing. First, something needs to break up words so that abbreviations are checked as one word, as Kevin pointed out. Within the check functions in speller_impl.cpp there is some breakup going on so that hyphenated words are accepted (web-site). It was that second level where I was having difficulty some time ago and which I thought was fixed. Kevin pointed you toward some examples. Are your problems now solved? Do you agree that there are two levels of tokenization going on, and if so at which level are you having difficulty? Best regards, Gary _______________________________________________ Aspell-user mailing list [email protected] http://lists.gnu.org/mailman/listinfo/aspell-user
