> I suspect that my issue is getting the Field constructor to use a > different tokenizer. Can anyone help?
You need to basically come up with your own Tokenizer (You can always write a corresponding JavaCC grammar and compiling it would give the Tokenizer) Then you need to extend org.apache.lucene.analysis.Analyzer class and override the tokenStream() method. Now, wherever you are indexing/searching, use the object of this CustomAnalyzer. Public class MyAnalyzer extended Analyzer { public TokenStream tokenStream(....) { TokenStream ts = null; ts = new MyTokenizer(reader); /* Pass this tokenstream through other filters you are interested in */ } } Krovi. -----Original Message----- From: Bill Taylor [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 29, 2006 8:10 PM To: java-user@lucene.apache.org Subject: Installing a custom tokenizer I am indexing documents which are filled with government jargon. As one would expect, the standard tokenizer has problems with governmenteese. In particular, the documents use words such as 310N-P-Q as references to other documents. The standard tokenizer breaks this "word" at the dashes so that I can find P or Q but not the entire token. I know how to write a new tokenizer. I would like hints on how to install it and get my indexing system to use it. I don't want to modify the standard .jar file. What I think I want to do is set up my indexing operation to use the WhitespaceTokenizer instead of the normal one, but I am unsure how to do this. I know that the IndexTask has a setAnalyzer method. The document formats are rather complicated and I need special code to isolate the text strings which should be indexed. My file analyzer isolates the string I want to index, then does doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>, Field.Store.YES, Field.index.TOKENIZED)); I suspect that my issue is getting the Field constructor to use a different tokenizer. Can anyone help? Thanks. Bill Taylor --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]