Hello Again: I'm trying to figure out what Filters do to terms in Lucene, specifically
StandardTokenizer StandardFilter While these are usually 'enough' for my work, I need to know specifically what happens to the tokens in this, how they are split, etc. in order to make sure my indexes match my queries, which are being parsed/modified very specifically. I was tempted to make my own filter (like MyCrazyFilter) but I hesitate to throw away the 'standards' for no reason. Also, I have had a hard time finding information about writing your own Tokenizers and Token Filters, other than the fact that you can do this. Most of the work I want to do is fairly simple stuff, but I can't find much information on how Lucene does it. I specifically know I want to ensure the following: - tokens are broken at whitespace only, not at any other kinds of marks - tokens have no accents (I use a normalizer for this) - tokens do not only consist of punctuation (I use a simple function for this) - tokens do not have 'oddball' circumstances (such as the end of a sentence retaining that punctuation... I truncate this). Thanks, drago
