Re: Building a custom Tokenizer

Andi Vajda Sat, 17 Jul 2010 13:32:40 -0700


On Jul 17, 2010, at 22:23, Martin <mar...@webscio.net> wrote:

Hi there,
I'm trying to extend the PythonTokenizer class to build my owncustom tokenizer, but seem to get stuck pretty much soon after that.I know that I'm supposed to extend the incrementToken() method, butwhat exactly am I dealing with in there and what should it return?My goal is to construct a tokenizer that returns pretty largetokens, maybe sentences or even the whole content. The reason I needthis is that the NGramTokenFilter needs a TokenStream to run on, butany other tokenizer removes whitespaces from the text.. and I needngrams that span over spaces :(
Thanks in advance for any hints!

Check out the Java Lucene javadocs and ask again on java-u...@lucene.apache.orgwhere many more lucene expert users hang out. Subscribe first bysending mail to java-user-subscribe and following the instructions inthe response.


Andi..


Regards,
Martin

Re: Building a custom Tokenizer

Reply via email to