Re: Modify the StandardAnalyzer

Clas Rydergren Sat, 06 Sep 2003 04:17:20 -0700

Hi again,

incze I am not sure I follow you. Do you mean that I should index with the SimpleAnalyzer and searching with the StandardAnalyzer? That is not an option for me since SimpleAnalyzer do not index number/digits the way I would like to (however, the StandardAnalyzer does!)

I found that modifying the StandardTokenizer.jj would be possible in order to modify the StandardAnalyzer to my prefered behaviour. When finished modifying the .jj-file, I run it trough JavaCC to get the .java-files. However those .java-files are not possible to compile together with the current distribution because of problems with the exception handling (see http://www.mail-archive.com/[EMAIL PROTECTED]/msg01403.html )

Next try was to compile Lucene (nightly snapshot) from scratch using Ant. That failed with errors with the "class collosions":

[javac] Compiling 128 source files to /home/clryd/lucene/jakarta-lucene/bin/classes [javac] /home/clryd/lucene/jakarta-lucene/bin/src/org/apache/lucene/analysis/standard/StandardTokenizer.java:14: duplicate class: org.apache.lucene.analysis.standard.StandardTokenizer [javac] public class StandardTokenizer extends org.apache.lucene.analysis.Tokenizer implements StandardTokenizerConstants { [javac] ^ [javac] /home/clryd/lucene/jakarta-lucene/bin/src/org/apache/lucene/analysis/standard/StandardTokenizerTokenManager.java:5: duplicate class: org.apache.lucene.analysis.standard.StandardTokenizerTokenManager [javac] public class StandardTokenizerTokenManager implements StandardTokenizerConstants [javac] ^

So now my question is, where do I find a Lucene-verion which is possible to compile? Where do I find the source CVS f�r Lucene 1.2 r3, or something similar?

cheers
/Clas

> Hi, > > I have been experimenting with Lucene for a few hours, and now I'm looking > for a solution to this: > > When using the SimpleAnalyzer for indexing text, data like www.hotmail.com > seem to be indexed as www, hotmail and com which mean that a search for > "hotmail" will return a record. This is the behavior I am looking for! > However, since SimpleAnalyzer do not index numbers by default, I would like > to use the StandardAnalyzer. But, Standardanalyzer do not split the input > stream at ".". > > Ideally I should propably make my own analyser, but that seems to be a bit > complicated to me :(. Which is the simplest possible modification that I > need to make to the Lucene source to make the StandardAnalyzer split, for > example web-addresses, at "." into separately indexed words? > > Can this be made by modifications to the StandardTokenizer.jj? How? What is > the easiest way of getting such modification into the "compiled" Lucene? Is > there a need for recompiling everything? > > Appreciate all help! > > regards > clas
You can stack up the two analyzers, first run the simple then the standard
on the poutput.
incze
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

_________________________________________________________________ Tired of spam? Get advanced junk mail protection with MSN 8. http://join.msn.com/?page=features/junkmail


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Modify the StandardAnalyzer

Reply via email to