On Oct 11, 2005, at 7:52 AM, Hugo Lafayette wrote:

Why do not include that in the FrenchStemFilter "next()" method itself ?
It will be a bad design ?

I agree with your assessment. Conceptually, this is a stemming problem. By extension, it's not a tokenizing problem, and the behavior of the StandardTokenizer is fine.

The most recent Perl/CPAN version of the Snowball stemming library added an option to strip leading l', d', etc in French. I know this because until the most recent version, it didn't strip trailing 's in English, either, and I had to write some workarounds, just like you'll have to.

http://search.cpan.org/search?query=snowball

I'm curious: are there any cases in French where a string with an apostrophe in it ought to be split into two searchable tokens? I know of no such cases in English: you never want to search for the ll in you'll, or the O in O'Reilly, etc.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to