Bill Taylor wrote:

On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:

I'm in a real rush here, so pardon my brevity, but..... one of the
constructors for IndexWriter takes an Analyzer as a parameter, which can be a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you
right up.

that almost worked. I can't use a per Field analyzer because I have to process the content fields of all documents. I built a custom analyzer which extended the Standard Analyzer and replaced the tokenStream method with a new one which used WhitespaceTokenizer instead of StandardTokenizer. This meant that my document IDs were not split, but I lost the conversion of acronyms such as w.o. to wo and the like

So what I need to do is to make a new Tokenizer based on the StandardTokenizer except that a NUM on line 83 of StandardTokenizer.jj should be

| NUM: (<ALPHANUM> (<P> <ALPHANUM>) +  | <ALPHANUM>) >

so that a serial number need not have a digit in every other segment and a series of letters and digits without special characters such as a dash will be treated as a single word.

Questions:

1) If I change the .jj file in this way, how to I run javaCC to make a new tokenizer? The JavaCC documentation says that JavaCC generates a number of output files; I think that I only need the tokenizer code.

2) I suppose i have to tell the query parser to parse queries in the same way, is that right?

The reason I think so is that Luke says I have words such as w.o. in the index which the query parser can't find. I suspect I have to use the same Analyzer on both, right?

Get JavaCC and run it on StandardTokenizer.jj. This should be as simple as typing 'JavaCC StandardTokenizer.jj'...I believe with no output folder specified all of the files will be built in the current directory. Don't worry about not generating the ones you do not need--JavaCC will handle everything for you. If you use Eclipse I recommend the JavaCC plug-in. I find it very handy.

Generally you must run the same analyzer that you indexed with on your search strings...if the standard analyzer parses oldman-83 to oldman while indexing and you use whitespace analyzer while searching then you will attempt to find oldman-83 in the index instead of oldman (which was what standard analyzer stored).

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to