[Ferret-talk] [BUG] StandardTokenizer, multibyte and word boundaries

Andreas Korth Fri, 15 Jun 2007 05:43:15 -0700

Hi,

I'm not sure if this has been brought up before: I found a bug in  
StandardTokenizer which misinterprets non-ASCII characters as word  
boundaries. This happens only with words that contain non- 
alphanumeric characters.


Consider this example:

The text 'Gerd Schröder Straße' is properly tokenized to:

   ["Gerd", "Schröder", "Straße"]

as well as 'Gerd-Schroeder-Strasse':

["Gerd-Schroeder-Strasse"]

but 'Gerd-Schröder-Straße' yields:

   ["Gerd-Schr", "öder-Stra", "ße"]


So apparently, multibyte and non-word characters don't mix...

Cheers,
Andy

_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

[Ferret-talk] [BUG] StandardTokenizer, multibyte and word boundaries

Reply via email to