On Sun, Jan 16, 2011 at 7:37 PM, Trejkaz <trej...@trypticon.org> wrote: > So I guess I have two questions: > 1. Is there some way to do filtering to the text before > tokenisation without upsetting the offsets reported by the tokeniser? > 2. Is there some more general solution to this problem, such as an > existing tokeniser similar to StandardTokeniser but with better > Unicode awareness? >
Hi, I think you want to try the StandardTokenizer in 3.1 (make sure you pass Version.LUCENE_31 to get the new behavior) It implements UAX#29 algorithm which respects canonical equivalence... it sounds like thats what you want. http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org