Well, I tried that, and it doesn't seem to work still. I would be happy to
zip up the new files, so you can see what I'm using -- maybe you can get it
to work. The first time, I tried building the documents without quotes
surrounding each phrase. Then, I retried by enclosing every phrase within
double quotes. Neither seemed to work. When constructing the query string
for the search, I always added the double quotes (otherwise, it'd think it
was multiple terms). (I didn't even test the underscore and hyphenated
terms.) I thought Lucene was (sort of by default) set up to search quoted
phrases. From http://lucene.apache.org/java/docs/api/index.html --> A
Phrase is a group of words surrounded by double quotes such as "hello
dolly". So, this should be easy, right? I must be missing something
stupid.
Thanks,
Philip
Mark Miller-5 wrote:
>
> So this will recognize anything in quotes as a single token and '_' and
> '-' will not break up words. There may be some repercussions for the NUM
> token but nothing I'd worry about. maybe you want to use Unicode for '-'
> and '_' as well...I wouldn't worry about it myself.
>
> - Mark
>
>
> TOKEN : { // token patterns
>
> // basic word: a sequence of digits & letters
> <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >
>
> | <QUOTED: "\"" (~["\""])+ "\"">
>
> // internal apostrophes: O'Reilly, you're, O'Reilly's
> // use a post-filter to remove possesives
> | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
>
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
>
> // company names like AT&T and [EMAIL PROTECTED]
> | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
>
> // email addresses
> | <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
> (("."|"-") <ALPHANUM>)+ >
>
> // hostname
> | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
>
> // floating point, serial, model numbers, ip addresses, etc.
> // every other segment must have at least one digit
> | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
> | <HAS_DIGIT> <P> <ALPHANUM>
> | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
> | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
> )
> >
> | <#P: ("_"|"-"|"/"|"."|",") >
> | <#HAS_DIGIT: // at least one digit
> (<LETTER>|<DIGIT>)*
> <DIGIT>
> (<LETTER>|<DIGIT>)*
> >
>
> | < #ALPHA: (<LETTER>)+>
> | < #LETTER: // unicode letters
> [
> "\u0041"-"\u005a",
> "\u0061"-"\u007a",
> "\u00c0"-"\u00d6",
> "\u00d8"-"\u00f6",
> "\u00f8"-"\u00ff",
> "\u0100"-"\u1fff",
> "-", "_"
> ]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
--
View this message in context:
http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
Sent from the Lucene - Java Users forum at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]