normalize your text to NFC. then it will be \u0043 \u00F3 \u006D \u006F and will work...
On Fri, Feb 20, 2009 at 11:16 PM, Philip Puffinburger < ppuffinbur...@tlcdelivers.com> wrote: > >some changes were made to the StandardTokenizer.jflex grammer (you can svn > diff the two URLs fairly trivially) to better deal with correctly > >identifying word characters, but from what i can tell that should have > reduced the number of splits, not increased them. > > > >it's hard to tell from your email (because it was sent in the windows-1252 > >charset) but what exactly are the unicode characters you are putting > through the tokenizer (ie: "\u0030") ? knowing where it's splitting would > >help figure out what's happening. > > These are the characters that of going through: > > \u0043 \u006F \u0301 \u006D \u006F - C o <Combining Acute Accent> m o > > It's splitting at the \u0301. > > >worst case scenerio, you could probably use the StandardTokenizer from > >2.3.2 with the rest of the 2.4 code. > > We've thought of that, but would be the last thing we did to get it back to > working. > > >this will show you exactly what changed... > >svn diff > > http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_3/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex> > http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex > > Thanks for the links. I've never dealt with JFlex, so I'll have to do > some reading to know what is going on in those files. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com