Re: Wildcard query with untokenized punctuation (again)

Mark Miller Thu, 14 Jun 2007 12:06:57 -0700

All depends on what you are looking for. Ill try and give a hint as to what
is going on now:


When the QueryParser parsers <<smith,ann>> it will shove that whole piece to
the analyzer. Your analyzer returns two tokens: smith and ann. When the
QueryParser sees that more than one token is returned from a piece that was
fed to the analyzer, it makes a PhraseQuery with the each of the returned
tokens. Remember that the QueryParser feeds the analyzer in pieces, and then
creates queries based on the number of token produced from the piece (if the
piece even goes to the analyzer).

Since you will be preprocessing the query, the query parser is going to be
parsing <<smith ann*>> which causes it to feed the analyzer smith and then
ann*...neither of these pieces produce more than one token (ann* doesnt even
go to the analyzer), so no PhraseQuery is produced. Instead you will produce
a BooleanQuery with the term smith and the wildcard query ann*, both with an
occur of whatever your default operator is.

One thing I am wondering is if you even really want the query to be a
PhraseQuery or if your just accepting the behavior you getting from the
QueryParser. Right now, PhraseQuery's do not support wildcards (nor do
MultiPhraseQuery's). I don't think the support would be that difficult (use
a wildcard term enumerator to correctly fill out a MultiPhraseQuery), but it
might take some thought to get the QueryParser to act as you want (generate
a PhraseQuery or MultiPhraseQuery when it sees <<smith ann*>>).

Are you sure you need a PhraseQuery and not a Boolean query of Should
clauses?

- Mark

On 6/14/07, Renaud Waldura <[EMAIL PROTECTED]> wrote:


Thanks guys, I like it! I'm already applying some regexps before query
parsing anyway, so it's just another pass.

Now, I'm not sure how to do that without breaking another QP feature that
I
kind of like: the query <<smith,ann>> is parsed to PhraseQuery("smith
ann").
And that seems right, from a user standpoint.

In fact, considering this, I realize <<smith,ann*>> should be parsed to
MultiPhraseQuery("smith", "ann*"), not <<+smith +ann*>> as I said earlier.

Brrrr. Getting hairy. Any hope?

--Renaud



-----Original Message-----
From: Mark Miller [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 14, 2007 6:43 AM
To: [email protected]
Subject: Re: Wildcard query with untokenized punctuation (again)

Gotto agree with Erick here...best idea is just to preprocess the query
before sending it to the QueryParser.

My first thought is always to get out the sledgehammer...

- Mark

Erick Erickson wrote:
> Well, perhaps the simplest thing would be to pre-process the query and
> make the comma into a whitespace before sending anything to the query
> parser. I don't know how generalizable that sort of solution is in
> your problem space though....
>
> Best
> Erick
>
> On 6/13/07, Renaud Waldura <[EMAIL PROTECTED]> wrote:
>>
>> My very simple analyzer produces tokens made of digits and/or letters
>> only.
>> Anything else is discarded. E.g. the input "smith,anna" gets
>> tokenized as
>> 2
>> tokens, first "smith" then "anna".
>>
>> Say I have indexed documents that contained both "smith,anna" and
>> "smith,annanicole". To find them, I enter the query <<smith,ann*>>.
>> The stock Lucene 2.0 query parser produces a PrefixQuery for the
>> single token "smith,ann". This token doesn't exist in my index, and I
>> don't get a match.
>>
>> I have found some references to this:
>>
>> http://www.nabble.com/Wildcard-query-with-untokenized-punctuation-tf3
>> 378386
>>
>> .
>> html
>> but I don't understand how I can fix it. Comma-separated terms like
>> this can appear in any field; I don't think I can create an
>> untokenized field.
>>
>> Really what I would like in this case is for the comma to be
>> considered whitespace, and the query to be parsed to <<+smith
>> +ann*>>. Any way I can do that?
>>
>> --Renaud
>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Wildcard query with untokenized punctuation (again)

Reply via email to