Re: Phrase search using quotes -- special Tokenizer

Mark Miller Fri, 01 Sep 2006 04:53:11 -0700

Philip Brown wrote:

Hi,


After running some tests using the StandardAnalyzer, and getting 0 results
from the search, I believe I need a special Tokenizer/Analyzer.  Does
anybody have something that parses like the following:

- doesn't parse apart phrases (in quotes)
- doesn't parse/separate hyphentated or underscored words
other normal stuff like
- parses on whitespace
- removes periods in acronyms
- lowercases everything (even in quotes? -- maybe)

I basically have a set of terms, some of which are multi-worded phrases, but
none should ever be broken apart -- not when adding the documents, not when
querying the search results, etc.  I'm creating the field in the documents
as UN_TOKENIZED and using a StandardAnalyzer and basic Query object to get
the results.  Any suggestions and/or existing code that I could re-use to
fit this purpose?

Thanks.

Here is what I would do. Pull the Standard Analyzer out of Lucene.Modify StandardAnalyzer.jj. This is a JavaCC file. In it, there is someregex that defines tokens for parsing. Now try some steps similar tothis: add '_' and '-' to the definition of a letter. Add a new tokentype that eats quoted phrases...look at queryparser.jj for an example,prob about half way down the file <QUOTED>. Now run JavaCC on theStandardAnalyzer.jj. Search the mailing list when you find out that aParseException is screwing up compilation (I really wish someone wouldupdate that for the latest JavaCC if indeed that is the problem. Itsreally annoying, and excluding it from compilation doesn't seem to fixit anymore).


- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Phrase search using quotes -- special Tokenizer

Reply via email to