Re: Installing a custom tokenizer

Erick Erickson Tue, 29 Aug 2006 10:46:34 -0700

I'm in a real rush here, so pardon my brevity, but..... one of the
constructors for IndexWriter takes an Analyzer as a parameter, which can be
a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you
right up.


Same kind of thing for a Query.

Erick

On 8/29/06, Bill Taylor <[EMAIL PROTECTED]> wrote:


I am indexing documents which are filled with government jargon.  As
one would expect, the standard tokenizer has problems with
governmenteese.

In particular, the documents use words such as 310N-P-Q as references
to other documents.  The standard tokenizer breaks this "word" at the
dashes so that I can find P or Q but not the entire token.

I know how to write a new tokenizer.  I would like hints on how to
install it and get my indexing system to use it.  I don't want to
modify the standard .jar file.  What I think I want to do is set up my
indexing operation to use the WhitespaceTokenizer instead of the normal
one, but I am unsure how to do this.

I know that the IndexTask has a setAnalyzer method.  The document
formats are rather complicated and I need special code to isolate the
text strings which should be indexed.   My file analyzer isolates the
string I want to index, then does

doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
Field.Store.YES, Field.index.TOKENIZED));

I suspect that my issue is getting the Field constructor to use a
different tokenizer.  Can anyone help?

Thanks.

Bill Taylor


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Installing a custom tokenizer

Reply via email to