Bill Taylor wrote:
On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:
I'm in a real rush here, so pardon my brevity, but..... one of the
constructors for IndexWriter takes an Analyzer as a parameter, which
can be
a PerFieldAnalyzerWrapper. That, if I understand your issue, should
fix you
right up.
that almost worked. I can't use a per Field analyzer because I have
to process the content fields of all documents. I built a custom
analyzer which extended the Standard Analyzer and replaced the
tokenStream method with a new one which used WhitespaceTokenizer
instead of StandardTokenizer. This meant that my document IDs were
not split, but I lost the conversion of acronyms such as w.o. to wo
and the like
So what I need to do is to make a new Tokenizer based on the
StandardTokenizer except that a NUM on line 83 of StandardTokenizer.jj
should be
| NUM: (<ALPHANUM> (<P> <ALPHANUM>) + | <ALPHANUM>) >
so that a serial number need not have a digit in every other segment
and a series of letters and digits without special characters such as
a dash will be treated as a single word.
Questions:
1) If I change the .jj file in this way, how to I run javaCC to make a
new tokenizer? The JavaCC documentation says that JavaCC generates a
number of output files; I think that I only need the tokenizer code.
2) I suppose i have to tell the query parser to parse queries in the
same way, is that right?
The reason I think so is that Luke says I have words such as w.o. in
the index which the query parser can't find. I suspect I have to use
the same Analyzer on both, right?
Get JavaCC and run it on StandardTokenizer.jj. This should be as simple
as typing 'JavaCC StandardTokenizer.jj'...I believe with no output
folder specified all of the files will be built in the current
directory. Don't worry about not generating the ones you do not
need--JavaCC will handle everything for you. If you use Eclipse I
recommend the JavaCC plug-in. I find it very handy.
Generally you must run the same analyzer that you indexed with on your
search strings...if the standard analyzer parses oldman-83 to oldman
while indexing and you use whitespace analyzer while searching then you
will attempt to find oldman-83 in the index instead of oldman (which was
what standard analyzer stored).
- Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]