Philip Brown wrote:
Do you mean StandardTokenizer.jj (org.apache.lucene.analysis.standard)?  I'm
not seeing StandardAnalyzer.jj in the Lucene source download.
Mark Miller-5 wrote:
Philip Brown wrote:
Hi,

After running some tests using the StandardAnalyzer, and getting 0
results
from the search, I believe I need a special Tokenizer/Analyzer.  Does
anybody have something that parses like the following:

- doesn't parse apart phrases (in quotes)
- doesn't parse/separate hyphentated or underscored words
other normal stuff like
- parses on whitespace
- removes periods in acronyms
- lowercases everything (even in quotes? -- maybe)

I basically have a set of terms, some of which are multi-worded phrases,
but
none should ever be broken apart -- not when adding the documents, not
when
querying the search results, etc.  I'm creating the field in the
documents
as UN_TOKENIZED and using a StandardAnalyzer and basic Query object to
get
the results.  Any suggestions and/or existing code that I could re-use to
fit this purpose?

Thanks.
Here is what I would do. Pull the Standard Analyzer out of Lucene. Modify StandardAnalyzer.jj. This is a JavaCC file. In it, there is some regex that defines tokens for parsing. Now try some steps similar to this: add '_' and '-' to the definition of a letter. Add a new token type that eats quoted phrases...look at queryparser.jj for an example, prob about half way down the file <QUOTED>. Now run JavaCC on the StandardAnalyzer.jj. Search the mailing list when you find out that a ParseException is screwing up compilation (I really wish someone would update that for the latest JavaCC if indeed that is the problem. Its really annoying, and excluding it from compilation doesn't seem to fix it anymore).

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Yes. Standard Tokenizer. Sorry about that...my brain is schizo. StandardTokenizer.jj in the StandardAnazlyer package.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to