Hi there,
I have been recently trying to build a lucene index out of ngrams and
seem to have stumbled on to a number of issues. I first tried to use the
NGramTokenizer, but that thing apparently only takes the first 1024
characters to tokenize. Having searched around the web, I came upon this
issue being discussed a couple of years ago and the proposed solution
there has been using the NGramTokenFilter. Now that filter certainly
works, but it needs an underlying tokenizer to work with, and I'm just
wondering if there is a tokenizer that would return me the whole text.
The reason I can't use something like the StandardTokenizer is that
ngrams should really include spaces and pretty much every tokenizer gets
rid of them.
Thank you very much in advance for any suggestions.
Regards,
Martin
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]