[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870376#action_12870376 ]
Mark Miller commented on LUCENE-2458: ------------------------------------- {quote} It seems to me making this behavior available with Version is the right way to go. I don't care if people call it a bug or a good default for US text - what count is to give people a good default no matter what they index or where they come from. (sounds like this is close to discriminating people - just kidding) {quote} Using Version or not is orthogonal to what the default is IMO though. That's why its important whether its considered a bug or an option - Version is not a good option selector at all. This is part of the goodness of stable/unstable - default options can change in unstable. > queryparser shouldn't generate phrasequeries based on term count > ---------------------------------------------------------------- > > Key: LUCENE-2458 > URL: https://issues.apache.org/jira/browse/LUCENE-2458 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser > Reporter: Robert Muir > Assignee: Robert Muir > Priority: Blocker > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2458.patch, LUCENE-2458.patch > > > The current method in the queryparser to generate phrasequeries is wrong: > The Query Syntax documentation > (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: > {noformat} > A Phrase is a group of words surrounded by double quotes such as "hello > dolly". > {noformat} > But as we know, this isn't actually true. > Instead the terms are first divided on whitespace, then the analyzer term > count is used as some sort of "heuristic" to determine if its a phrase query > or not. > This assumption is a disaster for languages that don't use whitespace > separation: CJK, compounding European languages like German, Finnish, etc. It > also > makes it difficult for people to use n-gram analysis techniques. In these > cases you get bad relevance (MAP improves nearly *10x* if you use a > PositionFilter at query-time to "turn this off" for chinese). > For even english, this undocumented behavior is bad. Perhaps in some cases > its being abused as some heuristic to "second guess" the tokenizer and piece > back things it shouldn't have split, but for large collections, doing things > like generating phrasequeries because StandardTokenizer split a compound on a > dash can cause serious performance problems. Instead people should analyze > their text with the appropriate methods, and QueryParser should only generate > phrase queries when the syntax asks for one. > The PositionFilter in contrib can be seen as a workaround, but its pretty > obscure and people are not familiar with it. The result is we have bad > out-of-box behavior for many languages, and bad performance for others on > some inputs. > I propose instead that we change the grammar to actually look for double > quotes to determine when to generate a phrase query, consistent with the > documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org