[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870384#action_12870384 ] Shai Erera commented on LUCENE-2458: bq. There will be tons of different opinions to that around linguists around the world but this parser is not for linguists in the first place. Sure. I've pointed that out just to show there are different opinions around that particular problem. bq. It seems to me making this behavior available with Version is the right way to go I disagree (and agree w/ Mark). Version can only control default behavior. This particular issue should be a setter IMO. Irregardless of what the default behavior is, I may want to set it differently. It doesn't make sense to *guess* what Version should I use in order to get that behavior. That's why I don't mind leaving the current behavior as default, and introduce a setter for whoever wants to change it. The current behavior is not applicable for just English - I bet there's a whole list of languages which would interpret that query the same (i.e. require a phrase to be generated). And I don't know the distribution of Lucene users around the world, but I'm not sure that CJK users are more common that say English ones, or other European languages. So who knows what a good default is? :) I suggest we leave the default as it is now, and introduce a setter. People have been working w/ the parser and that default for a long time. Why suddenly change it? queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Assignee: Robert Muir Priority: Blocker Fix For: 3.1, 4.0 Attachments: LUCENE-2458.patch, LUCENE-2458.patch The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
I can't tell if you are being obnoxious or seriously believe what you say. You understand that cjkanalyzer is broke with this? You understand that ngrams themselves capture information about position and it even works nicely with scoring, and helps. This hack doesn't help english. If you think otherwise, be a man and show real results On May 23, 2010 6:39 AM, Shai Erera (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issue. .. - To unsubscribe, e-mail: dev-un...
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
Obnoxiousness has certainly been in the air regarding this issue, I'll give you that. On Sunday, May 23, 2010, Robert Muir rcm...@gmail.com wrote: I can't tell if you are being obnoxious or seriously believe what you say. You understand that cjkanalyzer is broke with this? You understand that ngrams themselves capture information about position and it even works nicely with scoring, and helps. This hack doesn't help english. If you think otherwise, be a man and show real results On May 23, 2010 6:39 AM, Shai Erera (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issue... - To unsubscribe, e-mail: dev-un... -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870410#action_12870410 ] Uwe Schindler commented on LUCENE-2458: --- Hi Robert, I also agree with Mark (as you know). We can have both: - Version for a good default (3.1 will get the new non-phrase-query behavior) - A separate getsetter for this option (set/getCreatePhraseQueryOnConcenattedTerms or whatever) This would give you the best from both worlds. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Assignee: Robert Muir Priority: Blocker Fix For: 3.1, 4.0 Attachments: LUCENE-2458.patch, LUCENE-2458.patch The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
Robert - is the effect on scoring also on English and other European languages? Or is it mostly for ngram-based languages, and especially CJK? I want to stress that not all ngram-based languages are affected by this behavior, especially those for which we do ngram just because of a lack of good tokenizer. That's why I'm not sure the default should be changed and I'm all for a getter/setter. If however it turns out the default MUST be changed, then I support the Version + getter/setter approach. Shai On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) j...@apache.orgwrote: [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870410#action_12870410] Uwe Schindler commented on LUCENE-2458: --- Hi Robert, I also agree with Mark (as you know). We can have both: - Version for a good default (3.1 will get the new non-phrase-query behavior) - A separate getsetter for this option (set/getCreatePhraseQueryOnConcenattedTerms or whatever) This would give you the best from both worlds. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Assignee: Robert Muir Priority: Blocker Fix For: 3.1, 4.0 Attachments: LUCENE-2458.patch, LUCENE-2458.patch The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation ( http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
Same here, as already noted in the issue. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de/ http://www.thetaphi.de eMail: u...@thetaphi.de From: Shai Erera [mailto:ser...@gmail.com] Sent: Sunday, May 23, 2010 6:34 PM To: dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count Robert - is the effect on scoring also on English and other European languages? Or is it mostly for ngram-based languages, and especially CJK? I want to stress that not all ngram-based languages are affected by this behavior, especially those for which we do ngram just because of a lack of good tokenizer. That's why I'm not sure the default should be changed and I'm all for a getter/setter. If however it turns out the default MUST be changed, then I support the Version + getter/setter approach. Shai On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.pl ugin.system.issuetabpanels:comment-tabpanel https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.p lugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870410#actio n_12870410 focusedCommentId=12870410#action_12870410 ] Uwe Schindler commented on LUCENE-2458: --- Hi Robert, I also agree with Mark (as you know). We can have both: - Version for a good default (3.1 will get the new non-phrase-query behavior) - A separate getsetter for this option (set/getCreatePhraseQueryOnConcenattedTerms or whatever) This would give you the best from both worlds. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Assignee: Robert Muir Priority: Blocker Fix For: 3.1, 4.0 Attachments: LUCENE-2458.patch, LUCENE-2458.patch The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
These comments lead me to believe you don't understand the issue. Do you understand that *ALL* CJK queries are made into phrase queries, regardless of tokenizer?!!?!?! On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler u...@thetaphi.de wrote: Same here, as already noted in the issue. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de From: Shai Erera [mailto:ser...@gmail.com] Sent: Sunday, May 23, 2010 6:34 PM To: dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count Robert - is the effect on scoring also on English and other European languages? Or is it mostly for ngram-based languages, and especially CJK? I want to stress that not all ngram-based languages are affected by this behavior, especially those for which we do ngram just because of a lack of good tokenizer. That's why I'm not sure the default should be changed and I'm all for a getter/setter. If however it turns out the default MUST be changed, then I support the Version + getter/setter approach. Shai On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870410#action_12870410 ] Uwe Schindler commented on LUCENE-2458: --- Hi Robert, I also agree with Mark (as you know). We can have both: - Version for a good default (3.1 will get the new non-phrase-query behavior) - A separate getsetter for this option (set/getCreatePhraseQueryOnConcenattedTerms or whatever) This would give you the best from both worlds. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Assignee: Robert Muir Priority: Blocker Fix For: 3.1, 4.0 Attachments: LUCENE-2458.patch, LUCENE-2458.patch The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
On Sun, May 23, 2010 at 12:34 PM, Shai Erera ser...@gmail.com wrote: I want to stress that not all ngram-based languages are affected by this behavior, especially those for which we do ngram just because of a lack of good tokenizer. They are also affected! Do you understand how the queryparser treats whitespace? You cannot currently use normal word spanning n-grams with lucene because of this: 1) you can only use word-internal n-grams because each whitespace-separated word gets its own tokenstream 2) all queries here are also made into phrasequeries automatically, which is stupid as n-grams already contain the 'positional information' -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
Robert - I hope hitting the keyboard hard makes you happy :) I do get the issue. And I still think that CJK queries are just a small percentage of all queries that are used in the world today. Or at least by Lucene. And I'm not sure why we want to change the default for ALL OTHER LANGUAGES, just so that CJK QUERIES will hAVe A diFFEreNT BeHAvioR !!?!?!?!?!?!??!?!?! On Sun, May 23, 2010 at 7:42 PM, Robert Muir rcm...@gmail.com wrote: These comments lead me to believe you don't understand the issue. Do you understand that *ALL* CJK queries are made into phrase queries, regardless of tokenizer?!!?!?! On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler u...@thetaphi.de wrote: Same here, as already noted in the issue. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de From: Shai Erera [mailto:ser...@gmail.com] Sent: Sunday, May 23, 2010 6:34 PM To: dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count Robert - is the effect on scoring also on English and other European languages? Or is it mostly for ngram-based languages, and especially CJK? I want to stress that not all ngram-based languages are affected by this behavior, especially those for which we do ngram just because of a lack of good tokenizer. That's why I'm not sure the default should be changed and I'm all for a getter/setter. If however it turns out the default MUST be changed, then I support the Version + getter/setter approach. Shai On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870410#action_12870410 ] Uwe Schindler commented on LUCENE-2458: --- Hi Robert, I also agree with Mark (as you know). We can have both: - Version for a good default (3.1 will get the new non-phrase-query behavior) - A separate getsetter for this option (set/getCreatePhraseQueryOnConcenattedTerms or whatever) This would give you the best from both worlds. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Assignee: Robert Muir Priority: Blocker Fix For: 3.1, 4.0 Attachments: LUCENE-2458.patch, LUCENE-2458.patch The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Robert Muir rcm
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
its not just CJK queries, its in general any language not separated on whitespace. There are a lot of other languages that don't use whitespace the same way english does. On Sun, May 23, 2010 at 12:47 PM, Shai Erera ser...@gmail.com wrote: Robert - I hope hitting the keyboard hard makes you happy :) I do get the issue. And I still think that CJK queries are just a small percentage of all queries that are used in the world today. Or at least by Lucene. And I'm not sure why we want to change the default for ALL OTHER LANGUAGES, just so that CJK QUERIES will hAVe A diFFEreNT BeHAvioR !!?!?!?!?!?!??!?!?! On Sun, May 23, 2010 at 7:42 PM, Robert Muir rcm...@gmail.com wrote: These comments lead me to believe you don't understand the issue. Do you understand that *ALL* CJK queries are made into phrase queries, regardless of tokenizer?!!?!?! On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler u...@thetaphi.de wrote: Same here, as already noted in the issue. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de From: Shai Erera [mailto:ser...@gmail.com] Sent: Sunday, May 23, 2010 6:34 PM To: dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count Robert - is the effect on scoring also on English and other European languages? Or is it mostly for ngram-based languages, and especially CJK? I want to stress that not all ngram-based languages are affected by this behavior, especially those for which we do ngram just because of a lack of good tokenizer. That's why I'm not sure the default should be changed and I'm all for a getter/setter. If however it turns out the default MUST be changed, then I support the Version + getter/setter approach. Shai On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870410#action_12870410 ] Uwe Schindler commented on LUCENE-2458: --- Hi Robert, I also agree with Mark (as you know). We can have both: - Version for a good default (3.1 will get the new non-phrase-query behavior) - A separate getsetter for this option (set/getCreatePhraseQueryOnConcenattedTerms or whatever) This would give you the best from both worlds. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Assignee: Robert Muir Priority: Blocker Fix For: 3.1, 4.0 Attachments: LUCENE-2458.patch, LUCENE-2458.patch The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply
RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
Yes I understand that. But because of this it is still not a bug, it is a feature (and also implemented like that) to build phrase queries without Quotes, e.g. by simply appending works with ASCII-hyphens (for most European analyzers). And exactly to preserve this behavior, lets simply switch it on/of using a getsetter. That’s all I want, really. I know you are right and I still want to drink beer with you in Berlin and not being killed :-) I just want to make the feature accessible and documented without Version. The idea behind Version would be contradicted. Also the feature would go in 4.0. That’s all and I hope that you understand my argument. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Sunday, May 23, 2010 6:43 PM To: dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count These comments lead me to believe you don't understand the issue. Do you understand that *ALL* CJK queries are made into phrase queries, regardless of tokenizer?!!?!?! On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler u...@thetaphi.de wrote: Same here, as already noted in the issue. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de From: Shai Erera [mailto:ser...@gmail.com] Sent: Sunday, May 23, 2010 6:34 PM To: dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count Robert - is the effect on scoring also on English and other European languages? Or is it mostly for ngram-based languages, and especially CJK? I want to stress that not all ngram-based languages are affected by this behavior, especially those for which we do ngram just because of a lack of good tokenizer. That's why I'm not sure the default should be changed and I'm all for a getter/setter. If however it turns out the default MUST be changed, then I support the Version + getter/setter approach. Shai On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.j ira.plugin.system.issuetabpanels:comment- tabpanelfocusedCommentId=128 70410#action_12870410 ] Uwe Schindler commented on LUCENE-2458: --- Hi Robert, I also agree with Mark (as you know). We can have both: - Version for a good default (3.1 will get the new non-phrase-query behavior) - A separate getsetter for this option (set/getCreatePhraseQueryOnConcenattedTerms or whatever) This would give you the best from both worlds. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Assignee: Robert Muir Priority: Blocker Fix For: 3.1, 4.0 Attachments: LUCENE-2458.patch, LUCENE-2458.patch The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
The QP should work like that: (1) It parses the query, creating fragments (2) It does some out-of-the-box handling of those fragments People should be able to override that handling of fragments. But people should not touch (1). In fact QP should work like that: (1) Tokenizer parses the query as if it was a string of text. Care must be taken to preserve query language operators, as this stage essentially replaces current QP's lexer stage. (2) QP's syntax parser kicks in, identifies operators (those that Tokenizer didn't treat as a part of word tokens) and does overridable out-of-the-box handling for them and tokens around them. The point is - it's hard to do correctly. That's why Lucene resorts to upside-down approach. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
On Sun, May 23, 2010 at 1:00 PM, Uwe Schindler u...@thetaphi.de wrote: I just want to make the feature accessible and documented without Version. I think it is just a bug (a shoddy implementation that does not use the syntax, whether it was quoted or not, since this has been thrown away). In this implementation no one thought about languages that don't use whitespace and that it would make all queries into phrasequeries. I really do not think this sort of code belongs inside core lucene, if you want to make uninternationalized code in your own code base that is not correct that is fine. Furthermore by preserving this kind of bug it makes the queryparser more complicated, and especially in the future. If at some point in the future you want to really have the QP not split on whitespace (as you yourself said on the issue you want) to enable support for multi-word synonyms and real n-grams at querytime, I hope you understand this buggy code conflicts and complicates this later goal. -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
So ... after a long IRC chat on this, I think this has just been worded incorrectly (the issue). As I understand, there are two issues here: 1) QP loses a phrase info for fields -- the query f:abcd and f:abcd are parsed the same, or handled the same. There is no way for the one extending QP to tell if quotes were used. 2) QP has a default impl for f:abcd which is not international-friendly. I agree (1) should be fixed, and I apologize if I missed that previously. Version is the right way to go with this. About (2), I think that if f:abcd is submitted, then a PQ should not be created. The user hasn't asked for it. But if f:abcd was submitted, then it is ok to create a PQ by default. And we're only talking about defaults here. Anyone should be able to extend QP and override the relevant getFieldQuery variant and do whatever he wants. If the question on what should be the default behavior for (2), then I think pending Version, it should create a PQ for f:abcd only. And we leave it to the extended to determine what should be his right behavior. Shai On Sun, May 23, 2010 at 9:09 PM, Robert Muir rcm...@gmail.com wrote: On Sun, May 23, 2010 at 1:00 PM, Uwe Schindler u...@thetaphi.de wrote: I just want to make the feature accessible and documented without Version. I think it is just a bug (a shoddy implementation that does not use the syntax, whether it was quoted or not, since this has been thrown away). In this implementation no one thought about languages that don't use whitespace and that it would make all queries into phrasequeries. I really do not think this sort of code belongs inside core lucene, if you want to make uninternationalized code in your own code base that is not correct that is fine. Furthermore by preserving this kind of bug it makes the queryparser more complicated, and especially in the future. If at some point in the future you want to really have the QP not split on whitespace (as you yourself said on the issue you want) to enable support for multi-word synonyms and real n-grams at querytime, I hope you understand this buggy code conflicts and complicates this later goal. -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
+1, this is what the patch does. I agree i did a crappy job explaining the issue. On Sun, May 23, 2010 at 2:25 PM, Shai Erera ser...@gmail.com wrote: So ... after a long IRC chat on this, I think this has just been worded incorrectly (the issue). As I understand, there are two issues here: 1) QP loses a phrase info for fields -- the query f:abcd and f:abcd are parsed the same, or handled the same. There is no way for the one extending QP to tell if quotes were used. 2) QP has a default impl for f:abcd which is not international-friendly. I agree (1) should be fixed, and I apologize if I missed that previously. Version is the right way to go with this. About (2), I think that if f:abcd is submitted, then a PQ should not be created. The user hasn't asked for it. But if f:abcd was submitted, then it is ok to create a PQ by default. And we're only talking about defaults here. Anyone should be able to extend QP and override the relevant getFieldQuery variant and do whatever he wants. If the question on what should be the default behavior for (2), then I think pending Version, it should create a PQ for f:abcd only. And we leave it to the extended to determine what should be his right behavior. Shai On Sun, May 23, 2010 at 9:09 PM, Robert Muir rcm...@gmail.com wrote: On Sun, May 23, 2010 at 1:00 PM, Uwe Schindler u...@thetaphi.de wrote: I just want to make the feature accessible and documented without Version. I think it is just a bug (a shoddy implementation that does not use the syntax, whether it was quoted or not, since this has been thrown away). In this implementation no one thought about languages that don't use whitespace and that it would make all queries into phrasequeries. I really do not think this sort of code belongs inside core lucene, if you want to make uninternationalized code in your own code base that is not correct that is fine. Furthermore by preserving this kind of bug it makes the queryparser more complicated, and especially in the future. If at some point in the future you want to really have the QP not split on whitespace (as you yourself said on the issue you want) to enable support for multi-word synonyms and real n-grams at querytime, I hope you understand this buggy code conflicts and complicates this later goal. -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870317#action_12870317 ] Mark Miller commented on LUCENE-2458: - I still don't think this falls under bug territory myself - which leads me to thinking that Version is not the correct way to handle it. The icecream example showing that this is not a 'perfect' solution even for english does not show its a bug in my opinion either. I still vote to make this an option. Or make another QueryParser that works with more languages, and I guess with less 'biased' english language operators. The whole idea of the new QP was to make that type of thing easy if I remember right. bq. So users who want to emulate the English-optimized forced PhraseQuery even when user didn't say so explicitly can create QP This should be an option, not an emulation. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Assignee: Robert Muir Priority: Critical Fix For: 3.1, 4.0 Attachments: LUCENE-2458.patch, LUCENE-2458.patch The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870353#action_12870353 ] Shai Erera commented on LUCENE-2458: FWIW, I agree w/ Mark. I don't think it's a bug, but more of a user option. Whether it should be specified by a setter, or an extension of QP - I have no strong feelings for either of them, so either would be fine by me. And for what's it's also worth, we've once worked w/ a Japanese linguist, who suggested that we always convert queries like [abcd] to [abcd abcd] or just [abcd] because if someone had already bothered to write them like that, then phrase matching should contribute to the rank of the documents. IMO, if someone had gone even further by writing [field:abcd], then even if the query should be [field:a field:b field:c field:d], executing the query [field:abcd] is still important and better. So .. I'm not trying to argue what should be the default behavior, because that is subject to personal flavor and apps requirements -- only to emphasize that there are many user cases out there, and we should cater for such scenarios. The extension way is already supported, right? So perhaps we just need to document the current behavior, and not change anything? Or, introduce a setter, that will do the simple thing - either keep it as a phrase or break it down to terms. More sophisticated scenarios can be dealt through extension. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Assignee: Robert Muir Priority: Blocker Fix For: 3.1, 4.0 Attachments: LUCENE-2458.patch, LUCENE-2458.patch The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12869280#action_12869280 ] Michael McCandless commented on LUCENE-2458: OK mulling some more on this one... Even for english, the QP hack (pre-splitting on whitespace, then turning any text that analyzers to multiple tokens into a PhraseQuery), doesn't work right. EG, say I want ice-cream, ice cream and icecream to mean the same thing. Really I should do this (handling compounds) during indexing -- I'll get better relevance and performance. But say for some reason I'm doing it at search time... I would want an analyzer that detects all three forms and in turn expands to all three forms in the query. But there's no way to do this today, because QP pre-splits on whitespace, for ice cream the analyzer would separately receive ice and cream, so it never has a chance to detect this form of the compound. So... first, I think we should fix QP to not pre-split on whitespace. QP really should be as language neutral as possible. It should only split on syntax chars, and send the whole string in between syntax chars to the analyzer. And, second, the QP should not create PhraseQuery when it sees multiple tokens come back. This obliterates the OOTB experience for non-whitespace languages. And, it doesn't work right for english... so I think we should deprecate the option and default it to off. Really the contrib queryparser is a better fit for doing rewrites like this: it's able to operate on the abstract query tree, and can easily do things like rewriting the query to add phrase queries... queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867112#action_12867112 ] Robert Muir commented on LUCENE-2458: - {quote} This is why I like the token attr based solution {quote} I am, and will always be, -1 to this solution. Why can't we try to think about lucene from a proper internationalization architecture perspective? You shouldnt design apis around e-mail phenomena in english, thats absurd. {quote} BTW, this appears to not be an English-only need; this page (http://www.seobythesea.com/?p=1206) lists these example languages as also using English-like compound words: Some example languages that use compound words include: Afrikaans, Danish, Dutch-Flemish, English, Faroese, Frisian, High German, Gutnish, Icelandic, Low German, Norwegian, Swedish, and Yiddish. {quote} Please don't try to insinuate that phrases are the way you should handle compound terms in these languages unless you have some actual evidence that they should be used instead of normal decompounding. These languages have different syntax and word formation, and its simply not appropriate. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867117#action_12867117 ] Uwe Schindler commented on LUCENE-2458: --- Sorry for intervening, I am in the same opinion like Hoss: A lot of people are common to be able to create phrases in search engines by appending words with dashes (which StandardAnalyzer is perfectly doing with the current query parser impl). As quotes are slower to write, I e.g. always use this approach to search for phrases in Google this-is-a-phrase, which works always and brings identical results like this is a phrase (only ranking is sometimes slightly different in Google). So we should have at least some possibility to switch the behavior on that creates phrase queries out of multiple tokens with posIncr0 -- but I am +1 on fixing the problem for non-whitespace languages like cjk. Its also broken, that QueryParser parses whitespace in its javacc grammar, in my opinion, this should be done by the analyzer (and not partly by analyzer and QP grammar). queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867147#action_12867147 ] Yonik Seeley commented on LUCENE-2458: -- bq This is why I like the token attr based solution +1 Although I think it's more general than de-compounding. An attribute that says these tokens go together or these tokens should be considered one unit seems like nice generic functionality, and is unrelated to any specific language or search feature. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867151#action_12867151 ] Robert Muir commented on LUCENE-2458: - {quote} An attribute that says these tokens go together or these tokens should be considered one unit seems like nice generic functionality, and is unrelated to any specific language or search feature. {quote} No, if they are one unit for search, they are one token. Instead the tokenizer should be fixed so that they are one token, instead of making all languages suffer for the lack of a crappy english tokenizer. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866528#action_12866528 ] Michael McCandless commented on LUCENE-2458: This is sneaky behavior on QueryParser's part! I didn't realize it did this. What are some real use-cases where this is good? WordDelmiterFilter seems like a good example (eg, Wi-Fi - Wi Fi). It sounds like it's a very bad default for non-whitespace languages. It seems like we should make it controllable, switch it under Version, and change the default going forward to not do this? bq. Token Attributes could be used to instruct QueryParser as to the intent behind a stream of multiple tokens? This seems like a good idea (since we seem to have real-world cases where it's very useful and others where it's very bad)? Could/should it be per-analyzer? (ie, WDF would always do this but, say, ICUAnalyzer would never). Or, per-token created? queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
The QueryParser also fails to correctly parse Hebrew acronyms; although not being an integral part of the current discussion, I thought this would be the best place to bring that up. Hebrew acronyms are assembled of letters with a single double-quote char within, example: MNKL (Hebrew for CEO). That double-quote char usually comes at the before-last position of the word, but for some cases it can come before (MNKLIT). Since the QP expects two sets of double-quotes enclosing a phrase, an exception will be thrown if such a word has been passed to it, or an incorrect phrase query will be produced if two acronyms are used together in a query string. Not sure which is worse. Perhaps while you're at it you could make sure to only create a phrase query if a quote is followed by a space - hence is definitely at the end of a word, and not just assume it to be equivalent to a white space? Although there's no good open Hebrew analyzer for Lucene yet hence no motivation for this to be fixed, I'm working on one as we speak and hopefully will have something to show in the next few weeks/days. It would be nice to have at least this issue closed within the Lucene core code. Thanks, Itamar Syn-Hershko - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
On Wed, May 12, 2010 at 6:05 AM, Itamar Syn-Hershko ita...@code972.com wrote: The QueryParser also fails to correctly parse Hebrew acronyms; although not being an integral part of the current discussion, I thought this would be the best place to bring that up. Just as I don't think Analysis should do QueryParsing, I don't think QueryParsing should do Analysis either. Similar problems to this exist in other languages (I have to escape : for some, because lucene wants to interpret it as a field name). But this can be easily remedied on the application side, its documented and understood that the double-quote is a special character, and there is an escape mechanism so you can escape the ones you think are acronyms. This issue is about about a buggy implementation: its not documented and only internal to how the queryparser determines what is a phrase query or not (and, contrary to what you would believe from the documentation, the choice of whether or not to make a PhraseQuery is not based on syntax one bit!) -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
On 5/12/10 9:25 AM, Robert Muir wrote: (and, contrary to what you would believe from the documentation, the choice of whether or not to make a PhraseQuery is not based on syntax one bit!) Thats a major exaggeration - quoting text plays a large role in whether or not you will get a phrase query. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
On Wed, May 12, 2010 at 11:16 AM, Mark Miller markrmil...@gmail.com wrote: Thats a major exaggeration - quoting text plays a large role in whether or not you will get a phrase query. No, it has nothing to do with it in the implementation. It only escapes the whitespace, but is discarded. This is clear from looking at the grammar. The logic then to determine if you get a phrase query is the huge mess of code in getFieldQuery, but its not based on the double quotes at all. For example a list of chinese or thai words gets a phrase query, only because they don't use whitespace between words. But a similar list of english words gets a boolean query. -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866595#action_12866595 ] Marvin Humphrey commented on LUCENE-2458: - I have mixed feelings about this for English. It's a weakness of our engine that we do not take position of terms within a query string into account. At times I've tried to modify the scoring hierarchy to improve the situation, but I gave up because it was too difficult. This behavior of QueryParser is a sneaky way of getting around that limitation by turning stuff which should almost certainly be treated as phrase queries as such. It's the one place where we actually exploit position data within the query string. Mike's wi-fi example, though, wouldn't suffer that badly. The terms wi and fi are unlikely to occur much outside the context of 'wi-fi/wi fi/wifi'. And treating wi-fi as a phrase still won't conflate results with wifi as it would ideally. The example I would use doesn't typically apply to Lucene. Lucene's StandardAnalyzer tokenizes URLs as wholes, but KinoSearch's analogous analyzer breaks them up into individual components. As described in another recent thread, this allows a search for 'example.com' to match a document which contains the URL 'http://www.example.com/index.html'. It would suck if all of a sudden a search for 'example.com' started matching every document that contained 'com'. You could, and theoretically should, address this problem with sophisticated analysis. But it does make it harder to write a good Analyzer. You make it more important to solve what Yonik calls the 'e space mail' problem by making it worse. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866603#action_12866603 ] Marvin Humphrey commented on LUCENE-2458: - Because they show its 10x better to use this operator for Chinese Another way to achieve this 10x improvement is to change how QP performs its first stage of tokenization, as you and I discussed at ApacheCon Oakland. Right now QP splits on whitespace. If that behavior were customizable, e.g. via a splitter Analyzer, then individual Han characters would get submitted to getFieldQuery() -- and thus getFieldQuery() would no longer turn long strings of Han characters into a PhraseQuery. It seems wrong to continue to push entire query strings from non-whitespace-delimited languages down into getFieldQuery(). queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1285#action_1285 ] Ivan Provalov commented on LUCENE-2458: --- Robert has asked me to post our test results on the Chinese Collection. We used the following data collection from TREC: http://trec.nist.gov/data/qrels_noneng/index.html qrels.trec6.29-54.chinese.gz qrels.1-28.chinese.gz http://trec.nist.gov/data/topics_noneng TREC-6 Chinese topics (.gz) TREC-5 Chinese topics (.gz) Mandarin Data Collection http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000T52 Analyzer Name Plain analyzers Added PositionFilter (only at query time) ChineseAnalyzer 0.028 0.264 CJKAnalyzer 0.027 0.284 SmartChinese 0.027 0.265 IKAnalyzer 0.028 0.259 (Note: IKAnalyzer has its own IKQueryParser which yields 0.084 for the average precision) Thanks, Ivan Provalov queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866693#action_12866693 ] Marvin Humphrey commented on LUCENE-2458: - I'm honestly having a tough time seeing where to proceed on this issue. Change the initial split on whitespace to be customizable. Override the splitting behavior for non-whitespace-delimited languages and feed getFieldQuery() smaller chunks. That solves your problem without removing behavior most people believe to be helpful. Insisting on that orthogonal change is what is holding things up. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866696#action_12866696 ] Hoss Man commented on LUCENE-2458: -- bq. Instead the queryparser should only form phrasequeries when you use double quotes, just like the documentation says. i'll grant you that the documentation is wrong -- i view that as a bug in the documentation, not the code. Saying that a PhraseQuery object should only ever be constructed if the user uses quotes is like saying that a BooleanQuery should only ever be constructed if the user specifies boolean operators -- there is no rule that the syntax must match the query structure, the same Query classes can serve multiple purposes. The parser syntax should be what makes sense for hte user, and the query structure constructed should be what makes sense for hte index, based on the syntax used by the user. If i have built an index consisting entirely of ngrams, the end user shouldn't have to know that -- they shouldn't have to put every individual word in quotes to force a PhraseQuery to be constructed out of the ngram tokenstream produced by an individual word. bq. Why isn't english treated this way too? I don't consider this bias towards english at all costs including preventing languages such as Chinese from working at all very fair, I think its a really ugly stance for Lucene to take. I personally don't view it as an english bias ... to be it is a backwards compatibility bias I'm totally happy to make things onfigurable, but if two diametrically opposed behaviors are both equally useful, and If there is a choice needs to be made between leaving the default configuration the way the current hardcoded behavior is, or make the default the exact opposite of what the current hardcoded behavior is, it is then i would prefer to leave the default alone -- especially since this beahior has been around for so long, and many Analyzers and TOkenFilters, have been written with this behavior specificly in mind (several examples of this are in the Lucene code base -- and if we have them in our own code, you can be sure they exist in the wild of client code that would break if this behavior changes by default) Once again: if this is a problem that can be solved per instance with token attributes, then by all means let's make *all* of the TokenFIlters that come out of the box implement this appropriately (english and non-english alike) so that people who change the default settings on the queryparser get the correct behavior regardless of langauge. but all other things being equal lets keep the behavior working the way it has to avoid suprises. bq. What are some real use-cases where this is good? * WordDelimiterFilter (wi-fi is a pathalogicaly bad example for this issue because as robert pointed out wi and fi don't tend to exist independently in english, but people tend to get anoyed when race-horse matches all docs containing race or horse) * single word to multiword synonym expansion/transformation (particularly acronym expansion: GE = General Electric) * Ngram indexing for fuzzy matching (if someone searches for the word billionaire they're going to be surprised to get documents containing lion) queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866695#action_12866695 ] Robert Muir commented on LUCENE-2458: - {quote} Change the initial split on whitespace to be customizable. Override the splitting behavior for non-whitespace-delimited languages and feed getFieldQuery() smaller chunks. {quote} Whitespace doesn't separate words in the majority of the world's languages, including english. The responsibility should be instead on english to do its language-specific processing, not on everyone else to dodge it. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866698#action_12866698 ] Robert Muir commented on LUCENE-2458: - bq. but all other things being equal lets keep the behavior working the way it has to avoid suprises. This attitude makes me sick. The surprise is to the CJK users that get no results due to undocumented, english-specific hacks that people refuse to let go of. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
I think we understand each other perfectly well. I still think resolving this is very simple, by just applying a correct logic (ignore double-quotes followed by a char) which isn't enforced today and once it will be, it won't cause any cases of unexpected behavior. This isn't an analysis related task, and I'm not sure what makes you insist so bad. I will be openning a dedicated JIRA ticket for this discussion if this won't become part of the current one. Itamar. -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Thursday, May 13, 2010 1:42 AM To: dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count On Wed, May 12, 2010 at 6:30 PM, Itamar Syn-Hershko ita...@code972.com wrote: Never did I request the QP to do Analysis. I simply mentioned this bug - what this definitely is - Its definitely not a bug for Hebrew, there is a unicode character for gershayim (U+05F4), so technically this should be used according to unicode. Its arguably your responsibility to convert your data to unicode before passing it thru Lucene, and that includes disambiguating when a double quote should be gershayim -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
Internationalization doesn't work by just piling hacks for language X, language Y, and language Z on top of each other. Just like I want the English hack removed, I strongly recommend against adding any Hebrew hack. On Wed, May 12, 2010 at 6:55 PM, Itamar Syn-Hershko ita...@code972.com wrote: I think we understand each other perfectly well. I still think resolving this is very simple, by just applying a correct logic (ignore double-quotes followed by a char) which isn't enforced today and once it will be, it won't cause any cases of unexpected behavior. This isn't an analysis related task, and I'm not sure what makes you insist so bad. I will be openning a dedicated JIRA ticket for this discussion if this won't become part of the current one. Itamar. -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Thursday, May 13, 2010 1:42 AM To: dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count On Wed, May 12, 2010 at 6:30 PM, Itamar Syn-Hershko ita...@code972.com wrote: Never did I request the QP to do Analysis. I simply mentioned this bug - what this definitely is - Its definitely not a bug for Hebrew, there is a unicode character for gershayim (U+05F4), so technically this should be used according to unicode. Its arguably your responsibility to convert your data to unicode before passing it thru Lucene, and that includes disambiguating when a double quote should be gershayim -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866954#action_12866954 ] DM Smith commented on LUCENE-2458: -- As I see it there are two issues: 1) Backward compatibility. 2) Correctness according to the syntax definition of a query. Let me preface the following by saying I have not studied the query parser in Lucene. Over 20 years ago I got an MS in compiler writing. I've been away from it for quite a while. So, IMHO as a former compiler writer: Maybe I'm just not getting it but it should be trivial to define the grammar (w/ precedence for any ambiguity, if necessary) and implement it. The tokenizer for the parser should have the responsibility to break the input into sequences of meta and non-meta. This tokenizer should not be anything more than what the parser requires. The non-meta reasonably is subject to further tokenization/analysis. This further analysis should be entirely under the user's control. It should not be part of the parser. Regarding the issue, I think it would be best if a quotation was the sole criteria for the determination of what is a phrase, not some heuristical analysis of the token stream. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866341#action_12866341 ] Hoss Man commented on LUCENE-2458: -- Robter: do you have a specific suggestion for what QueryParser should do if a single chunk of input causes the Analyzer to produce multiple tokens that are not at the same position (ie: the current case where QueryParser produces a PhraseQuery even if there are no quotes) Ie: if the query parser is asked to parse... {code}fieldName:A-Field-Value{code} ...and the Analyzer produces three tokens... * A (at position 0) * Field (at position 1) * Value (at position 2) ...what should the resulting Query object be? queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866353#action_12866353 ] Robert Muir commented on LUCENE-2458: - bq. ...what should the resulting Query object be? a Boolean Query formed with the default operator. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866363#action_12866363 ] Hoss Man commented on LUCENE-2458: -- bq. a Boolean Query formed with the default operator. That seems like equally bad default behavior -- lots of existing TokenFilters produce chains of tokens for situations where the user creating the query string clearly intended to be searching for a single word and has no idea that as an implementation detail multiple tokens were produced under the covers (ie: WordDelimiterFilter, Ngrams, etc...) I haven't thought this through very well, but perhaps this is an area where (the new) Token Attributes could be used to instruct QueryParser as to the intent behind a stream of multiple tokens? A new Attribute could be used on each token to convey when that token should be combined with teh previous token, and in what way: as a phrase, as a conjunction or as a disjunction. (this could still be orthogonal to the position, which would indicate slop/span type information like it does currently) Stock Analysys components that produce multiple tokens could be modified to add this attribute fairly easily (it should be a relatively static value for any component that currently splits tokens) and QueryParser could have an option controlling what to do if it encounters a token w/o this attribute (perhaps even two options: one for quoted input chunks and one for unquoted input chunks). that way the default could still work in a back compatible way, but people using languages that don't use whitespace separation *and* are using older (or custom) analyzers that don't know about this attribute could set a simple query parser property to force this behavior. would that make sense? (asks the man who only vaguely understands Token Attributes at this point) queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866368#action_12866368 ] Robert Muir commented on LUCENE-2458: - bq. That seems like equally bad default behavior Do you have measurements to support this? Because they show its 10x better to use this operator for Chinese :) bq. I haven't thought this through very well, but perhaps this is an area where (the new) Token Attributes I disagree. Instead the queryparser should only form phrasequeries when you use double quotes, just like the documentation says. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
[ https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866374#action_12866374 ] Robert Muir commented on LUCENE-2458: - by the way hoss man you said it best yourself: {quote} lots of existing TokenFilters produce chains of tokens for situations where the user creating the query string clearly intended to be searching for a single word and has no idea that as an implementation detail multiple tokens were produced under the covers (ie: WordDelimiterFilter, Ngrams, etc...) {quote} User clearly intended is wrong. WordDelimiterFilter will break tibetan text in a similar manner (it uses no spaces between words), yet no user clearly intended to form phrase queries. Users clearly intend to form phrase queries only when they use the phrase query operator, thats how the query parser is documented to work, and its a bug that it doesnt work that way. queryparser shouldn't generate phrasequeries based on term count Key: LUCENE-2458 URL: https://issues.apache.org/jira/browse/LUCENE-2458 Project: Lucene - Java Issue Type: Bug Components: QueryParser Reporter: Robert Muir Priority: Critical The current method in the queryparser to generate phrasequeries is wrong: The Query Syntax documentation (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states: {noformat} A Phrase is a group of words surrounded by double quotes such as hello dolly. {noformat} But as we know, this isn't actually true. Instead the terms are first divided on whitespace, then the analyzer term count is used as some sort of heuristic to determine if its a phrase query or not. This assumption is a disaster for languages that don't use whitespace separation: CJK, compounding European languages like German, Finnish, etc. It also makes it difficult for people to use n-gram analysis techniques. In these cases you get bad relevance (MAP improves nearly *10x* if you use a PositionFilter at query-time to turn this off for chinese). For even english, this undocumented behavior is bad. Perhaps in some cases its being abused as some heuristic to second guess the tokenizer and piece back things it shouldn't have split, but for large collections, doing things like generating phrasequeries because StandardTokenizer split a compound on a dash can cause serious performance problems. Instead people should analyze their text with the appropriate methods, and QueryParser should only generate phrase queries when the syntax asks for one. The PositionFilter in contrib can be seen as a workaround, but its pretty obscure and people are not familiar with it. The result is we have bad out-of-box behavior for many languages, and bad performance for others on some inputs. I propose instead that we change the grammar to actually look for double quotes to determine when to generate a phrase query, consistent with the documentation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org