Tokenizer, TokenStream, Token Filters

K. M. McCormick Mon, 10 Aug 2009 16:16:12 -0700

Hello Again:

I'm trying to figure out what Filters do to terms in Lucene, specifically


StandardTokenizer
StandardFilter

While these are usually 'enough' for my work, I need to know specifically
what happens to the tokens in this, how they are split, etc. in order to
make sure my indexes match my queries, which are being parsed/modified very
specifically. I was tempted to make my own filter (like MyCrazyFilter) but I
hesitate to throw away the 'standards' for no reason.

Also, I have had a hard time finding information about writing your own
Tokenizers and Token Filters, other than the fact that you can do this. Most
of the work I want to do is fairly simple stuff, but I can't find much
information on how Lucene does it.

I specifically know I want to ensure the following:
- tokens are broken at whitespace only, not at any other kinds of marks
- tokens have no accents (I use a normalizer for this)
- tokens do not only consist of punctuation (I use a simple function for
this)
- tokens do not have 'oddball' circumstances (such as the end of a sentence
retaining that punctuation... I  truncate this).

Thanks,
drago

Tokenizer, TokenStream, Token Filters

Reply via email to