Re: Installing a custom tokenizer

Mark Miller Tue, 29 Aug 2006 12:04:41 -0700

Bill Taylor wrote:

On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:
I'm in a real rush here, so pardon my brevity, but..... one of the
constructors for IndexWriter takes an Analyzer as a parameter, whichcan bea PerFieldAnalyzerWrapper. That, if I understand your issue, shouldfix you
right up.
that almost worked. I can't use a per Field analyzer because I haveto process the content fields of all documents. I built a customanalyzer which extended the Standard Analyzer and replaced thetokenStream method with a new one which used WhitespaceTokenizerinstead of StandardTokenizer. This meant that my document IDs werenot split, but I lost the conversion of acronyms such as w.o. to woand the like
So what I need to do is to make a new Tokenizer based on theStandardTokenizer except that a NUM on line 83 of StandardTokenizer.jjshould be
| NUM: (<ALPHANUM> (<P> <ALPHANUM>) +  | <ALPHANUM>) >
so that a serial number need not have a digit in every other segmentand a series of letters and digits without special characters such asa dash will be treated as a single word.
Questions:
1) If I change the .jj file in this way, how to I run javaCC to make anew tokenizer? The JavaCC documentation says that JavaCC generates anumber of output files; I think that I only need the tokenizer code.
2) I suppose i have to tell the query parser to parse queries in thesame way, is that right?
The reason I think so is that Luke says I have words such as w.o. inthe index which the query parser can't find. I suspect I have to usethe same Analyzer on both, right?

Get JavaCC and run it on StandardTokenizer.jj. This should be as simpleas typing 'JavaCC StandardTokenizer.jj'...I believe with no outputfolder specified all of the files will be built in the currentdirectory. Don't worry about not generating the ones you do notneed--JavaCC will handle everything for you. If you use Eclipse Irecommend the JavaCC plug-in. I find it very handy.

Generally you must run the same analyzer that you indexed with on yoursearch strings...if the standard analyzer parses oldman-83 to oldmanwhile indexing and you use whitespace analyzer while searching then youwill attempt to find oldman-83 in the index instead of oldman (which waswhat standard analyzer stored).


- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Installing a custom tokenizer

Reply via email to