Philip Brown wrote:
Hi,
After running some tests using the StandardAnalyzer, and getting 0 results
from the search, I believe I need a special Tokenizer/Analyzer. Does
anybody have something that parses like the following:
- doesn't parse apart phrases (in quotes)
- doesn't parse/separate hyphentated or underscored words
other normal stuff like
- parses on whitespace
- removes periods in acronyms
- lowercases everything (even in quotes? -- maybe)
I basically have a set of terms, some of which are multi-worded phrases, but
none should ever be broken apart -- not when adding the documents, not when
querying the search results, etc. I'm creating the field in the documents
as UN_TOKENIZED and using a StandardAnalyzer and basic Query object to get
the results. Any suggestions and/or existing code that I could re-use to
fit this purpose?
Thanks.
Here is what I would do. Pull the Standard Analyzer out of Lucene.
Modify StandardAnalyzer.jj. This is a JavaCC file. In it, there is some
regex that defines tokens for parsing. Now try some steps similar to
this: add '_' and '-' to the definition of a letter. Add a new token
type that eats quoted phrases...look at queryparser.jj for an example,
prob about half way down the file <QUOTED>. Now run JavaCC on the
StandardAnalyzer.jj. Search the mailing list when you find out that a
ParseException is screwing up compilation (I really wish someone would
update that for the latest JavaCC if indeed that is the problem. Its
really annoying, and excluding it from compilation doesn't seem to fix
it anymore).
- Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]