RE: Wildcard query with untokenized punctuation (again)

Renaud Waldura Thu, 14 Jun 2007 10:29:55 -0700

Thanks guys, I like it! I'm already applying some regexps before query
parsing anyway, so it's just another pass.


Now, I'm not sure how to do that without breaking another QP feature that I
kind of like: the query <<smith,ann>> is parsed to PhraseQuery("smith ann").
And that seems right, from a user standpoint.

In fact, considering this, I realize <<smith,ann*>> should be parsed to
MultiPhraseQuery("smith", "ann*"), not <<+smith +ann*>> as I said earlier.

Brrrr. Getting hairy. Any hope?

--Renaud



-----Original Message-----
From: Mark Miller [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 14, 2007 6:43 AM
To: [email protected]
Subject: Re: Wildcard query with untokenized punctuation (again)

Gotto agree with Erick here...best idea is just to preprocess the query
before sending it to the QueryParser.

My first thought is always to get out the sledgehammer...

- Mark

Erick Erickson wrote:
> Well, perhaps the simplest thing would be to pre-process the query and 
> make the comma into a whitespace before sending anything to the query 
> parser. I don't know how generalizable that sort of solution is in 
> your problem space though....
>
> Best
> Erick
>
> On 6/13/07, Renaud Waldura <[EMAIL PROTECTED]> wrote:
>>
>> My very simple analyzer produces tokens made of digits and/or letters 
>> only.
>> Anything else is discarded. E.g. the input "smith,anna" gets 
>> tokenized as
>> 2
>> tokens, first "smith" then "anna".
>>
>> Say I have indexed documents that contained both "smith,anna" and 
>> "smith,annanicole". To find them, I enter the query <<smith,ann*>>. 
>> The stock Lucene 2.0 query parser produces a PrefixQuery for the 
>> single token "smith,ann". This token doesn't exist in my index, and I 
>> don't get a match.
>>
>> I have found some references to this:
>>
>> http://www.nabble.com/Wildcard-query-with-untokenized-punctuation-tf3
>> 378386
>>
>> .
>> html
>> but I don't understand how I can fix it. Comma-separated terms like 
>> this can appear in any field; I don't think I can create an 
>> untokenized field.
>>
>> Really what I would like in this case is for the comma to be 
>> considered whitespace, and the query to be parsed to <<+smith 
>> +ann*>>. Any way I can do that?
>>
>> --Renaud
>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Wildcard query with untokenized punctuation (again)

Reply via email to