On Mon, Jan 17, 2011 at 11:53 AM, Robert Muir <rcm...@gmail.com> wrote: > On Sun, Jan 16, 2011 at 7:37 PM, Trejkaz <trej...@trypticon.org> wrote: >> So I guess I have two questions: >> 1. Is there some way to do filtering to the text before >> tokenisation without upsetting the offsets reported by the tokeniser? >> 2. Is there some more general solution to this problem, such as an >> existing tokeniser similar to StandardTokeniser but with better >> Unicode awareness? >> > > Hi, I think you want to try the StandardTokenizer in 3.1 (make sure > you pass Version.LUCENE_31 to get the new behavior) > It implements UAX#29 algorithm which respects canonical equivalence... > it sounds like thats what you want.
This does sound like what we want, although it sounds like it might take time to first identify whether UAX#29 will break the text the way we want it (there aren't any solid examples of how the algorithm works on different kinds of text in the standard itself, which is a bit unfortunate.) The other problem is that we're still stuck on 2.9 due to having deprecated features in our codebase still, and having very little time to do anything about it. Moving to the new API is taking a while, as some of those API changes are quite tricky to refactor for (TokenStream in particular, makes fixing a single class take half a day, once you add the time to verify that it is working correctly.) TX --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org