[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

Hoss Man (JIRA) Tue, 11 May 2010 14:56:06 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866363#action_12866363
 ]


Hoss Man commented on LUCENE-2458:
----------------------------------

bq. a Boolean Query formed with the default operator.

That seems like equally bad default behavior -- lots of existing TokenFilters 
produce chains of tokens for situations where the user creating the query 
string clearly intended to be searching for a single "word" and has no idea 
that as an implementation detail multiple tokens were produced under the covers 
(ie: WordDelimiterFilter, Ngrams, etc...)

I haven't thought this through very well, but perhaps this is an area where 
(the new) Token Attributes could be used to instruct QueryParser as to the 
intent behind a stream of multiple tokens?  A new Attribute could be used on 
each token to convey when that token should be combined with teh previous 
token, and in what way: as a phrase, as a conjunction or as a disjunction.  
(this could still be orthogonal to the position, which would indicate slop/span 
type information like it does currently)

Stock Analysys components that produce multiple tokens could be modified to add 
this attribute fairly easily (it should be a relatively static value for any 
component that currently "splits" tokens) and QueryParser could have an option 
controlling what to do if  it encounters a token w/o this attribute (perhaps 
even two options: one for quoted input chunks and one for unquoted input 
chunks).

that way the default could still work in a back compatible way, but people 
using languages that don't use whitespace separation *and* are using older (or 
custom) analyzers that don't know about this attribute could set a simple query 
parser property to force this behavior.

would that make sense? (asks the man who only vaguely understands Token 
Attributes at this point)

> queryparser shouldn't generate phrasequeries based on term count
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>            Reporter: Robert Muir
>            Priority: Critical
>
> The current method in the queryparser to generate phrasequeries is wrong:
> The Query Syntax documentation 
> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
> {noformat}
> A Phrase is a group of words surrounded by double quotes such as "hello 
> dolly".
> {noformat}
> But as we know, this isn't actually true.
> Instead the terms are first divided on whitespace, then the analyzer term 
> count is used as some sort of "heuristic" to determine if its a phrase query 
> or not.
> This assumption is a disaster for languages that don't use whitespace 
> separation: CJK, compounding European languages like German, Finnish, etc. It 
> also
> makes it difficult for people to use n-gram analysis techniques. In these 
> cases you get bad relevance (MAP improves nearly *10x* if you use a 
> PositionFilter at query-time to "turn this off" for chinese).
> For even english, this undocumented behavior is bad. Perhaps in some cases 
> its being abused as some heuristic to "second guess" the tokenizer and piece 
> back things it shouldn't have split, but for large collections, doing things 
> like generating phrasequeries because StandardTokenizer split a compound on a 
> dash can cause serious performance problems. Instead people should analyze 
> their text with the appropriate methods, and QueryParser should only generate 
> phrase queries when the syntax asks for one.
> The PositionFilter in contrib can be seen as a workaround, but its pretty 
> obscure and people are not familiar with it. The result is we have bad 
> out-of-box behavior for many languages, and bad performance for others on 
> some inputs.
> I propose instead that we change the grammar to actually look for double 
> quotes to determine when to generate a phrase query, consistent with the 
> documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

Reply via email to