On Jul 17, 2010, at 22:23, Martin <mar...@webscio.net> wrote:
Hi there,
I'm trying to extend the PythonTokenizer class to build my own
custom tokenizer, but seem to get stuck pretty much soon after that.
I know that I'm supposed to extend the incrementToken() method, but
what exactly am I dealing with in there and what should it return?
My goal is to construct a tokenizer that returns pretty large
tokens, maybe sentences or even the whole content. The reason I need
this is that the NGramTokenFilter needs a TokenStream to run on, but
any other tokenizer removes whitespaces from the text.. and I need
ngrams that span over spaces :(
Thanks in advance for any hints!
Check out the Java Lucene javadocs and ask again on java-u...@lucene.apache.org
where many more lucene expert users hang out. Subscribe first by
sending mail to java-user-subscribe and following the instructions in
the response.
Andi..
Regards,
Martin