[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870384#action_12870384
 ] 

Shai Erera commented on LUCENE-2458:


bq. There will be tons of different opinions to that around linguists around 
the world but this parser is not for linguists in the first place.

Sure. I've pointed that out just to show there are different opinions around 
that particular problem.

bq. It seems to me making this behavior available with Version is the right way 
to go

I disagree (and agree w/ Mark). Version can only control default behavior. This 
particular issue should be a setter IMO. Irregardless of what the default 
behavior is, I may want to set it differently. It doesn't make sense to *guess* 
what Version should I use in order to get that behavior.

That's why I don't mind leaving the current behavior as default, and introduce 
a setter for whoever wants to change it. The current behavior is not applicable 
for just English - I bet there's a whole list of languages which would 
interpret that query the same (i.e. require a phrase to be generated).

And I don't know the distribution of Lucene users around the world, but I'm not 
sure that CJK users are more common that say English ones, or other European 
languages. So who knows what a good default is? :)

I suggest we leave the default as it is now, and introduce a setter. People 
have been working w/ the parser and that default for a long time. Why suddenly 
change it?

 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch


 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Robert Muir
I can't tell if you are being obnoxious or seriously believe what you say.
You understand that cjkanalyzer is broke with this? You understand that
ngrams themselves capture information about position and it even works
nicely with scoring, and helps.

This hack doesn't help english.  If you think otherwise, be a man and show
real results

On May 23, 2010 6:39 AM, Shai Erera (JIRA) j...@apache.org wrote:


[
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issue.
..

-
To unsubscribe, e-mail: dev-un...


Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Mark Miller
Obnoxiousness has certainly been in the air regarding this issue, I'll
give you that.

On Sunday, May 23, 2010, Robert Muir rcm...@gmail.com wrote:
 I can't tell if you are being obnoxious or seriously believe what you say.  
 You understand that cjkanalyzer is broke with this? You understand that 
 ngrams themselves capture information about position and it even works nicely 
 with scoring, and helps.

 This hack doesn't help english.  If you think otherwise, be a man and show 
 real results
 On May 23, 2010 6:39 AM, Shai Erera (JIRA) j...@apache.org wrote:


 [ 
 https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issue...
 -
 To unsubscribe, e-mail: dev-un...


-- 
- Mark

http://www.lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870410#action_12870410
 ] 

Uwe Schindler commented on LUCENE-2458:
---

Hi Robert,

I also agree with Mark (as you know). We can have both:
- Version for a good default (3.1 will get the new non-phrase-query behavior)
- A separate getsetter for this option 
(set/getCreatePhraseQueryOnConcenattedTerms or whatever)

This would give you the best from both worlds.

 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch


 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Shai Erera
Robert - is the effect on scoring also on English and other European
languages? Or is it mostly for ngram-based languages, and especially CJK?

I want to stress that not all ngram-based languages are affected by this
behavior, especially those for which we do ngram just because of a lack of
good tokenizer.

That's why I'm not sure the default should be changed and I'm all for a
getter/setter. If however it turns out the default MUST be changed, then I
support the Version + getter/setter approach.

Shai

On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) j...@apache.orgwrote:


[
 https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870410#action_12870410]

 Uwe Schindler commented on LUCENE-2458:
 ---

 Hi Robert,

 I also agree with Mark (as you know). We can have both:
 - Version for a good default (3.1 will get the new non-phrase-query
 behavior)
 - A separate getsetter for this option
 (set/getCreatePhraseQueryOnConcenattedTerms or whatever)

 This would give you the best from both worlds.

  queryparser shouldn't generate phrasequeries based on term count
  
 
  Key: LUCENE-2458
  URL: https://issues.apache.org/jira/browse/LUCENE-2458
  Project: Lucene - Java
   Issue Type: Bug
   Components: QueryParser
 Reporter: Robert Muir
 Assignee: Robert Muir
 Priority: Blocker
  Fix For: 3.1, 4.0
 
  Attachments: LUCENE-2458.patch, LUCENE-2458.patch
 
 
  The current method in the queryparser to generate phrasequeries is wrong:
  The Query Syntax documentation (
 http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
  {noformat}
  A Phrase is a group of words surrounded by double quotes such as hello
 dolly.
  {noformat}
  But as we know, this isn't actually true.
  Instead the terms are first divided on whitespace, then the analyzer term
 count is used as some sort of heuristic to determine if its a phrase query
 or not.
  This assumption is a disaster for languages that don't use whitespace
 separation: CJK, compounding European languages like German, Finnish, etc.
 It also
  makes it difficult for people to use n-gram analysis techniques. In these
 cases you get bad relevance (MAP improves nearly *10x* if you use a
 PositionFilter at query-time to turn this off for chinese).
  For even english, this undocumented behavior is bad. Perhaps in some
 cases its being abused as some heuristic to second guess the tokenizer and
 piece back things it shouldn't have split, but for large collections, doing
 things like generating phrasequeries because StandardTokenizer split a
 compound on a dash can cause serious performance problems. Instead people
 should analyze their text with the appropriate methods, and QueryParser
 should only generate phrase queries when the syntax asks for one.
  The PositionFilter in contrib can be seen as a workaround, but its pretty
 obscure and people are not familiar with it. The result is we have bad
 out-of-box behavior for many languages, and bad performance for others on
 some inputs.
  I propose instead that we change the grammar to actually look for double
 quotes to determine when to generate a phrase query, consistent with the
 documentation.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Uwe Schindler
Same here, as already noted in the issue.

 

Uwe

 

-

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de/ http://www.thetaphi.de

eMail: u...@thetaphi.de

 

From: Shai Erera [mailto:ser...@gmail.com] 
Sent: Sunday, May 23, 2010 6:34 PM
To: dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate
phrasequeries based on term count

 

Robert - is the effect on scoring also on English and other European
languages? Or is it mostly for ngram-based languages, and especially CJK?

I want to stress that not all ngram-based languages are affected by this
behavior, especially those for which we do ngram just because of a lack of
good tokenizer.

That's why I'm not sure the default should be changed and I'm all for a
getter/setter. If however it turns out the default MUST be changed, then I
support the Version + getter/setter approach.

Shai

On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) j...@apache.org
wrote:


   [
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.pl
ugin.system.issuetabpanels:comment-tabpanel
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.p
lugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870410#actio
n_12870410 focusedCommentId=12870410#action_12870410 ]

Uwe Schindler commented on LUCENE-2458:
---

Hi Robert,

I also agree with Mark (as you know). We can have both:
- Version for a good default (3.1 will get the new non-phrase-query
behavior)
- A separate getsetter for this option
(set/getCreatePhraseQueryOnConcenattedTerms or whatever)

This would give you the best from both worlds.

 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch


 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation
(http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello
dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term
count is used as some sort of heuristic to determine if its a phrase query
or not.
 This assumption is a disaster for languages that don't use whitespace
separation: CJK, compounding European languages like German, Finnish, etc.
It also
 makes it difficult for people to use n-gram analysis techniques. In these
cases you get bad relevance (MAP improves nearly *10x* if you use a
PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases
its being abused as some heuristic to second guess the tokenizer and piece
back things it shouldn't have split, but for large collections, doing things
like generating phrasequeries because StandardTokenizer split a compound on
a dash can cause serious performance problems. Instead people should analyze
their text with the appropriate methods, and QueryParser should only
generate phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty
obscure and people are not familiar with it. The result is we have bad
out-of-box behavior for many languages, and bad performance for others on
some inputs.
 I propose instead that we change the grammar to actually look for double
quotes to determine when to generate a phrase query, consistent with the
documentation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-

To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org

For additional commands, e-mail: dev-h...@lucene.apache.org

 



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Robert Muir
These comments lead me to believe you don't understand the issue.

Do you understand that *ALL* CJK queries are made into phrase queries,
regardless of tokenizer?!!?!?!

On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler u...@thetaphi.de wrote:
 Same here, as already noted in the issue.



 Uwe



 -

 Uwe Schindler

 H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de

 eMail: u...@thetaphi.de



 From: Shai Erera [mailto:ser...@gmail.com]
 Sent: Sunday, May 23, 2010 6:34 PM

 To: dev@lucene.apache.org
 Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate
 phrasequeries based on term count



 Robert - is the effect on scoring also on English and other European
 languages? Or is it mostly for ngram-based languages, and especially CJK?

 I want to stress that not all ngram-based languages are affected by this
 behavior, especially those for which we do ngram just because of a lack of
 good tokenizer.

 That's why I'm not sure the default should be changed and I'm all for a
 getter/setter. If however it turns out the default MUST be changed, then I
 support the Version + getter/setter approach.

 Shai

 On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) j...@apache.org
 wrote:

    [
 https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870410#action_12870410
 ]

 Uwe Schindler commented on LUCENE-2458:
 ---

 Hi Robert,

 I also agree with Mark (as you know). We can have both:
 - Version for a good default (3.1 will get the new non-phrase-query
 behavior)
 - A separate getsetter for this option
 (set/getCreatePhraseQueryOnConcenattedTerms or whatever)

 This would give you the best from both worlds.

 queryparser shouldn't generate phrasequeries based on term count
 

                 Key: LUCENE-2458
                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
             Project: Lucene - Java
          Issue Type: Bug
          Components: QueryParser
            Reporter: Robert Muir
            Assignee: Robert Muir
            Priority: Blocker
             Fix For: 3.1, 4.0

         Attachments: LUCENE-2458.patch, LUCENE-2458.patch


 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term
 count is used as some sort of heuristic to determine if its a phrase query
 or not.
 This assumption is a disaster for languages that don't use whitespace
 separation: CJK, compounding European languages like German, Finnish, etc.
 It also
 makes it difficult for people to use n-gram analysis techniques. In these
 cases you get bad relevance (MAP improves nearly *10x* if you use a
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases
 its being abused as some heuristic to second guess the tokenizer and piece
 back things it shouldn't have split, but for large collections, doing things
 like generating phrasequeries because StandardTokenizer split a compound on
 a dash can cause serious performance problems. Instead people should analyze
 their text with the appropriate methods, and QueryParser should only
 generate phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty
 obscure and people are not familiar with it. The result is we have bad
 out-of-box behavior for many languages, and bad performance for others on
 some inputs.
 I propose instead that we change the grammar to actually look for double
 quotes to determine when to generate a phrase query, consistent with the
 documentation.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.


 -

 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org

 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Robert Muir
On Sun, May 23, 2010 at 12:34 PM, Shai Erera ser...@gmail.com wrote:

 I want to stress that not all ngram-based languages are affected by this
 behavior, especially those for which we do ngram just because of a lack of
 good tokenizer.


They are also affected! Do you understand how the queryparser treats
whitespace? You cannot currently use normal word spanning n-grams
with lucene because of this:

1) you can only use word-internal n-grams because each
whitespace-separated word gets its own tokenstream
2) all queries here are also made into phrasequeries automatically,
which is stupid as n-grams already contain the 'positional
information'

-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Shai Erera
Robert - I hope hitting the keyboard hard makes you happy :)

I do get the issue. And I still think that CJK queries are just a small
percentage of all queries that are used in the world today. Or at least by
Lucene. And I'm not sure why we want to change the default for ALL OTHER
LANGUAGES, just so that CJK QUERIES will hAVe A diFFEreNT BeHAvioR
!!?!?!?!?!?!??!?!?!



On Sun, May 23, 2010 at 7:42 PM, Robert Muir rcm...@gmail.com wrote:

 These comments lead me to believe you don't understand the issue.

 Do you understand that *ALL* CJK queries are made into phrase queries,
 regardless of tokenizer?!!?!?!

 On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler u...@thetaphi.de wrote:
  Same here, as already noted in the issue.
 
 
 
  Uwe
 
 
 
  -
 
  Uwe Schindler
 
  H.-H.-Meier-Allee 63, D-28213 Bremen
 
  http://www.thetaphi.de
 
  eMail: u...@thetaphi.de
 
 
 
  From: Shai Erera [mailto:ser...@gmail.com]
  Sent: Sunday, May 23, 2010 6:34 PM
 
  To: dev@lucene.apache.org
  Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't
 generate
  phrasequeries based on term count
 
 
 
  Robert - is the effect on scoring also on English and other European
  languages? Or is it mostly for ngram-based languages, and especially CJK?
 
  I want to stress that not all ngram-based languages are affected by this
  behavior, especially those for which we do ngram just because of a lack
 of
  good tokenizer.
 
  That's why I'm not sure the default should be changed and I'm all for a
  getter/setter. If however it turns out the default MUST be changed, then
 I
  support the Version + getter/setter approach.
 
  Shai
 
  On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) j...@apache.org
  wrote:
 
 [
 
 https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870410#action_12870410
  ]
 
  Uwe Schindler commented on LUCENE-2458:
  ---
 
  Hi Robert,
 
  I also agree with Mark (as you know). We can have both:
  - Version for a good default (3.1 will get the new non-phrase-query
  behavior)
  - A separate getsetter for this option
  (set/getCreatePhraseQueryOnConcenattedTerms or whatever)
 
  This would give you the best from both worlds.
 
  queryparser shouldn't generate phrasequeries based on term count
  
 
  Key: LUCENE-2458
  URL: https://issues.apache.org/jira/browse/LUCENE-2458
  Project: Lucene - Java
   Issue Type: Bug
   Components: QueryParser
 Reporter: Robert Muir
 Assignee: Robert Muir
 Priority: Blocker
  Fix For: 3.1, 4.0
 
  Attachments: LUCENE-2458.patch, LUCENE-2458.patch
 
 
  The current method in the queryparser to generate phrasequeries is
 wrong:
  The Query Syntax documentation
  (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
  {noformat}
  A Phrase is a group of words surrounded by double quotes such as hello
  dolly.
  {noformat}
  But as we know, this isn't actually true.
  Instead the terms are first divided on whitespace, then the analyzer
 term
  count is used as some sort of heuristic to determine if its a phrase
 query
  or not.
  This assumption is a disaster for languages that don't use whitespace
  separation: CJK, compounding European languages like German, Finnish,
 etc.
  It also
  makes it difficult for people to use n-gram analysis techniques. In
 these
  cases you get bad relevance (MAP improves nearly *10x* if you use a
  PositionFilter at query-time to turn this off for chinese).
  For even english, this undocumented behavior is bad. Perhaps in some
 cases
  its being abused as some heuristic to second guess the tokenizer and
 piece
  back things it shouldn't have split, but for large collections, doing
 things
  like generating phrasequeries because StandardTokenizer split a compound
 on
  a dash can cause serious performance problems. Instead people should
 analyze
  their text with the appropriate methods, and QueryParser should only
  generate phrase queries when the syntax asks for one.
  The PositionFilter in contrib can be seen as a workaround, but its
 pretty
  obscure and people are not familiar with it. The result is we have bad
  out-of-box behavior for many languages, and bad performance for others
 on
  some inputs.
  I propose instead that we change the grammar to actually look for double
  quotes to determine when to generate a phrase query, consistent with the
  documentation.
 
  --
  This message is automatically generated by JIRA.
  -
  You can reply to this email to add a comment to the issue online.
 
 
  -
 
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 



 --
 Robert Muir
 rcm

Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Robert Muir
its not just CJK queries, its in general any language not separated on
whitespace.

There are a lot of other languages that don't use whitespace the same
way english does.

On Sun, May 23, 2010 at 12:47 PM, Shai Erera ser...@gmail.com wrote:
 Robert - I hope hitting the keyboard hard makes you happy :)

 I do get the issue. And I still think that CJK queries are just a small
 percentage of all queries that are used in the world today. Or at least by
 Lucene. And I'm not sure why we want to change the default for ALL OTHER
 LANGUAGES, just so that CJK QUERIES will hAVe A diFFEreNT BeHAvioR
 !!?!?!?!?!?!??!?!?!



 On Sun, May 23, 2010 at 7:42 PM, Robert Muir rcm...@gmail.com wrote:

 These comments lead me to believe you don't understand the issue.

 Do you understand that *ALL* CJK queries are made into phrase queries,
 regardless of tokenizer?!!?!?!

 On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler u...@thetaphi.de wrote:
  Same here, as already noted in the issue.
 
 
 
  Uwe
 
 
 
  -
 
  Uwe Schindler
 
  H.-H.-Meier-Allee 63, D-28213 Bremen
 
  http://www.thetaphi.de
 
  eMail: u...@thetaphi.de
 
 
 
  From: Shai Erera [mailto:ser...@gmail.com]
  Sent: Sunday, May 23, 2010 6:34 PM
 
  To: dev@lucene.apache.org
  Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't
  generate
  phrasequeries based on term count
 
 
 
  Robert - is the effect on scoring also on English and other European
  languages? Or is it mostly for ngram-based languages, and especially
  CJK?
 
  I want to stress that not all ngram-based languages are affected by this
  behavior, especially those for which we do ngram just because of a lack
  of
  good tokenizer.
 
  That's why I'm not sure the default should be changed and I'm all for a
  getter/setter. If however it turns out the default MUST be changed, then
  I
  support the Version + getter/setter approach.
 
  Shai
 
  On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) j...@apache.org
  wrote:
 
     [
 
  https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870410#action_12870410
  ]
 
  Uwe Schindler commented on LUCENE-2458:
  ---
 
  Hi Robert,
 
  I also agree with Mark (as you know). We can have both:
  - Version for a good default (3.1 will get the new non-phrase-query
  behavior)
  - A separate getsetter for this option
  (set/getCreatePhraseQueryOnConcenattedTerms or whatever)
 
  This would give you the best from both worlds.
 
  queryparser shouldn't generate phrasequeries based on term count
  
 
                  Key: LUCENE-2458
                  URL: https://issues.apache.org/jira/browse/LUCENE-2458
              Project: Lucene - Java
           Issue Type: Bug
           Components: QueryParser
             Reporter: Robert Muir
             Assignee: Robert Muir
             Priority: Blocker
              Fix For: 3.1, 4.0
 
          Attachments: LUCENE-2458.patch, LUCENE-2458.patch
 
 
  The current method in the queryparser to generate phrasequeries is
  wrong:
  The Query Syntax documentation
  (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
  {noformat}
  A Phrase is a group of words surrounded by double quotes such as hello
  dolly.
  {noformat}
  But as we know, this isn't actually true.
  Instead the terms are first divided on whitespace, then the analyzer
  term
  count is used as some sort of heuristic to determine if its a phrase
  query
  or not.
  This assumption is a disaster for languages that don't use whitespace
  separation: CJK, compounding European languages like German, Finnish,
  etc.
  It also
  makes it difficult for people to use n-gram analysis techniques. In
  these
  cases you get bad relevance (MAP improves nearly *10x* if you use a
  PositionFilter at query-time to turn this off for chinese).
  For even english, this undocumented behavior is bad. Perhaps in some
  cases
  its being abused as some heuristic to second guess the tokenizer and
  piece
  back things it shouldn't have split, but for large collections, doing
  things
  like generating phrasequeries because StandardTokenizer split a
  compound on
  a dash can cause serious performance problems. Instead people should
  analyze
  their text with the appropriate methods, and QueryParser should only
  generate phrase queries when the syntax asks for one.
  The PositionFilter in contrib can be seen as a workaround, but its
  pretty
  obscure and people are not familiar with it. The result is we have bad
  out-of-box behavior for many languages, and bad performance for others
  on
  some inputs.
  I propose instead that we change the grammar to actually look for
  double
  quotes to determine when to generate a phrase query, consistent with
  the
  documentation.
 
  --
  This message is automatically generated by JIRA.
  -
  You can reply

RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Uwe Schindler
Yes I understand that. But because of this it is still not a bug, it is a 
feature (and also implemented like that) to build phrase queries without 
Quotes, e.g. by simply appending works with ASCII-hyphens (for most European 
analyzers). And exactly to preserve this behavior, lets simply switch it on/of 
using a getsetter. That’s all I want, really. I know you are right and I still 
want to drink beer with you in Berlin and not being killed :-) I just want to 
make the feature accessible and documented without Version. The idea behind 
Version would be contradicted. Also the feature would go in 4.0.

That’s all and I hope that you understand my argument.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Robert Muir [mailto:rcm...@gmail.com]
 Sent: Sunday, May 23, 2010 6:43 PM
 To: dev@lucene.apache.org
 Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't
 generate phrasequeries based on term count
 
 These comments lead me to believe you don't understand the issue.
 
 Do you understand that *ALL* CJK queries are made into phrase queries,
 regardless of tokenizer?!!?!?!
 
 On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler u...@thetaphi.de
 wrote:
  Same here, as already noted in the issue.
 
 
 
  Uwe
 
 
 
  -
 
  Uwe Schindler
 
  H.-H.-Meier-Allee 63, D-28213 Bremen
 
  http://www.thetaphi.de
 
  eMail: u...@thetaphi.de
 
 
 
  From: Shai Erera [mailto:ser...@gmail.com]
  Sent: Sunday, May 23, 2010 6:34 PM
 
  To: dev@lucene.apache.org
  Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't
  generate phrasequeries based on term count
 
 
 
  Robert - is the effect on scoring also on English and other European
  languages? Or is it mostly for ngram-based languages, and especially CJK?
 
  I want to stress that not all ngram-based languages are affected by
  this behavior, especially those for which we do ngram just because of
  a lack of good tokenizer.
 
  That's why I'm not sure the default should be changed and I'm all for
  a getter/setter. If however it turns out the default MUST be changed,
  then I support the Version + getter/setter approach.
 
  Shai
 
  On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA)
  j...@apache.org
  wrote:
 
 [
  https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.j
  ira.plugin.system.issuetabpanels:comment-
 tabpanelfocusedCommentId=128
  70410#action_12870410
  ]
 
  Uwe Schindler commented on LUCENE-2458:
  ---
 
  Hi Robert,
 
  I also agree with Mark (as you know). We can have both:
  - Version for a good default (3.1 will get the new non-phrase-query
  behavior)
  - A separate getsetter for this option
  (set/getCreatePhraseQueryOnConcenattedTerms or whatever)
 
  This would give you the best from both worlds.
 
  queryparser shouldn't generate phrasequeries based on term count
  
 
  Key: LUCENE-2458
  URL:
  https://issues.apache.org/jira/browse/LUCENE-2458
  Project: Lucene - Java
   Issue Type: Bug
   Components: QueryParser
 Reporter: Robert Muir
 Assignee: Robert Muir
 Priority: Blocker
  Fix For: 3.1, 4.0
 
  Attachments: LUCENE-2458.patch, LUCENE-2458.patch
 
 
  The current method in the queryparser to generate phrasequeries is
 wrong:
  The Query Syntax documentation
  (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
  {noformat}
  A Phrase is a group of words surrounded by double quotes such as
  hello dolly.
  {noformat}
  But as we know, this isn't actually true.
  Instead the terms are first divided on whitespace, then the analyzer
  term count is used as some sort of heuristic to determine if its a
  phrase query or not.
  This assumption is a disaster for languages that don't use whitespace
  separation: CJK, compounding European languages like German, Finnish,
 etc.
  It also
  makes it difficult for people to use n-gram analysis techniques. In
  these cases you get bad relevance (MAP improves nearly *10x* if you
  use a PositionFilter at query-time to turn this off for chinese).
  For even english, this undocumented behavior is bad. Perhaps in some
  cases its being abused as some heuristic to second guess the
  tokenizer and piece back things it shouldn't have split, but for
  large collections, doing things like generating phrasequeries because
  StandardTokenizer split a compound on a dash can cause serious
  performance problems. Instead people should analyze their text with
  the appropriate methods, and QueryParser should only generate phrase
 queries when the syntax asks for one.
  The PositionFilter in contrib can be seen as a workaround, but its
  pretty obscure and people are not familiar with it. The result is we
  have bad out-of-box behavior for many

Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Earwin Burrfoot
 The QP should work like that:
 (1) It parses the query, creating fragments
 (2) It does some out-of-the-box handling of those fragments

 People should be able to override that handling of fragments. But people
 should not touch (1).

In fact QP should work like that:
(1) Tokenizer parses the query as if it was a string of text.
Care must be taken to preserve query language operators, as this stage
essentially replaces current QP's lexer stage.
(2) QP's syntax parser kicks in, identifies operators (those that
Tokenizer didn't treat as a part of word tokens) and does overridable
out-of-the-box handling for them and tokens around them.

The point is - it's hard to do correctly. That's why Lucene resorts to
upside-down approach.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Robert Muir
On Sun, May 23, 2010 at 1:00 PM, Uwe Schindler u...@thetaphi.de wrote:
  I just want to make the feature accessible and documented without Version.

I think it is just a bug (a shoddy implementation that does not use
the syntax, whether it was quoted or not, since this has been thrown
away). In this implementation no one thought about languages that
don't use whitespace and that it would make all queries into
phrasequeries.

I really do not think this sort of code belongs inside core lucene, if
you want to make uninternationalized code in your own code base that
is not correct that is fine.

Furthermore by preserving this kind of bug it makes the queryparser
more complicated, and especially in the future. If at some point in
the future you want to really have the QP not split on whitespace (as
you yourself said on the issue you want) to enable support for
multi-word synonyms and real n-grams at querytime, I hope you
understand this buggy code conflicts and complicates this later goal.

-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Shai Erera
So ... after a long IRC chat on this, I think this has just been worded
incorrectly (the issue). As I understand, there are two issues here:
1) QP loses a phrase info for fields -- the query f:abcd and f:abcd are
parsed the same, or handled the same. There is no way for the one extending
QP to tell if quotes were used.
2) QP has a default impl for f:abcd which is not international-friendly.

I agree (1) should be fixed, and I apologize if I missed that previously.
Version is the right way to go with this.

About (2), I think that if f:abcd is submitted, then a PQ should not be
created. The user hasn't asked for it. But if f:abcd was submitted, then
it is ok to create a PQ by default. And we're only talking about defaults
here. Anyone should be able to extend QP and override the relevant
getFieldQuery variant and do whatever he wants.

If the question on what should be the default behavior for (2), then I think
pending Version, it should create a PQ for f:abcd only. And we leave it to
the extended to determine what should be his right behavior.

Shai

On Sun, May 23, 2010 at 9:09 PM, Robert Muir rcm...@gmail.com wrote:

 On Sun, May 23, 2010 at 1:00 PM, Uwe Schindler u...@thetaphi.de wrote:
   I just want to make the feature accessible and documented without
 Version.

 I think it is just a bug (a shoddy implementation that does not use
 the syntax, whether it was quoted or not, since this has been thrown
 away). In this implementation no one thought about languages that
 don't use whitespace and that it would make all queries into
 phrasequeries.

 I really do not think this sort of code belongs inside core lucene, if
 you want to make uninternationalized code in your own code base that
 is not correct that is fine.

 Furthermore by preserving this kind of bug it makes the queryparser
 more complicated, and especially in the future. If at some point in
 the future you want to really have the QP not split on whitespace (as
 you yourself said on the issue you want) to enable support for
 multi-word synonyms and real n-grams at querytime, I hope you
 understand this buggy code conflicts and complicates this later goal.

 --
 Robert Muir
 rcm...@gmail.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-23 Thread Robert Muir
+1, this is what the patch does. I agree i did a crappy job explaining
the issue.

On Sun, May 23, 2010 at 2:25 PM, Shai Erera ser...@gmail.com wrote:
 So ... after a long IRC chat on this, I think this has just been worded
 incorrectly (the issue). As I understand, there are two issues here:
 1) QP loses a phrase info for fields -- the query f:abcd and f:abcd are
 parsed the same, or handled the same. There is no way for the one extending
 QP to tell if quotes were used.
 2) QP has a default impl for f:abcd which is not international-friendly.

 I agree (1) should be fixed, and I apologize if I missed that previously.
 Version is the right way to go with this.

 About (2), I think that if f:abcd is submitted, then a PQ should not be
 created. The user hasn't asked for it. But if f:abcd was submitted, then
 it is ok to create a PQ by default. And we're only talking about defaults
 here. Anyone should be able to extend QP and override the relevant
 getFieldQuery variant and do whatever he wants.

 If the question on what should be the default behavior for (2), then I think
 pending Version, it should create a PQ for f:abcd only. And we leave it to
 the extended to determine what should be his right behavior.

 Shai

 On Sun, May 23, 2010 at 9:09 PM, Robert Muir rcm...@gmail.com wrote:

 On Sun, May 23, 2010 at 1:00 PM, Uwe Schindler u...@thetaphi.de wrote:
   I just want to make the feature accessible and documented without
  Version.

 I think it is just a bug (a shoddy implementation that does not use
 the syntax, whether it was quoted or not, since this has been thrown
 away). In this implementation no one thought about languages that
 don't use whitespace and that it would make all queries into
 phrasequeries.

 I really do not think this sort of code belongs inside core lucene, if
 you want to make uninternationalized code in your own code base that
 is not correct that is fine.

 Furthermore by preserving this kind of bug it makes the queryparser
 more complicated, and especially in the future. If at some point in
 the future you want to really have the QP not split on whitespace (as
 you yourself said on the issue you want) to enable support for
 multi-word synonyms and real n-grams at querytime, I hope you
 understand this buggy code conflicts and complicates this later goal.

 --
 Robert Muir
 rcm...@gmail.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org






-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870317#action_12870317
 ] 

Mark Miller commented on LUCENE-2458:
-

I still don't think this falls under bug territory myself - which leads me to 
thinking that Version is not the correct way to handle it.

The icecream example showing that this is not a 'perfect' solution even for 
english does not show its a bug in my opinion either. 

I still vote to make this an option. Or make another QueryParser that works 
with more languages, and I guess with less 'biased' english language operators. 
The whole idea of the new QP was to make that type of thing easy if I remember 
right.

bq. So users who want to emulate the English-optimized forced PhraseQuery even 
when user didn't say so explicitly can create QP

This should be an option, not an emulation.

 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Critical
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch


 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-22 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12870353#action_12870353
 ] 

Shai Erera commented on LUCENE-2458:


FWIW, I agree w/ Mark. I don't think it's a bug, but more of a user option. 
Whether it should be specified by a setter, or an extension of QP - I have no 
strong feelings for either of them, so either would be fine by me.

And for what's it's also worth, we've once worked w/ a Japanese linguist, who 
suggested that we always convert queries like [abcd] to [abcd abcd] or just 
[abcd] because if someone had already bothered to write them like that, then 
phrase matching should contribute to the rank of the documents. IMO, if someone 
had gone even further by writing [field:abcd], then even if the query should be 
[field:a field:b field:c field:d], executing the query [field:abcd] is still 
important and better.

So .. I'm not trying to argue what should be the default behavior, because that 
is subject to personal flavor and apps requirements -- only to emphasize that 
there are many user cases out there, and we should cater for such scenarios.

The extension way is already supported, right? So perhaps we just need to 
document the current behavior, and not change anything? Or, introduce a setter, 
that will do the simple thing - either keep it as a phrase or break it down to 
terms. More sophisticated scenarios can be dealt through extension.

 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2458.patch, LUCENE-2458.patch


 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12869280#action_12869280
 ] 

Michael McCandless commented on LUCENE-2458:


OK mulling some more on this one...

Even for english, the QP hack (pre-splitting on whitespace, then
turning any text that analyzers to multiple tokens into a
PhraseQuery), doesn't work right.

EG, say I want ice-cream, ice cream and icecream to mean the same
thing.  Really I should do this (handling compounds) during indexing
-- I'll get better relevance and performance.  But say for some reason
I'm doing it at search time...

I would want an analyzer that detects all three forms and in turn
expands to all three forms in the query.

But there's no way to do this today, because QP pre-splits on
whitespace, for ice cream the analyzer would separately receive ice
and cream, so it never has a chance to detect this form of the
compound.

So... first, I think we should fix QP to not pre-split on whitespace.
QP really should be as language neutral as possible.  It should only
split on syntax chars, and send the whole string in between syntax
chars to the analyzer.

And, second, the QP should not create PhraseQuery when it sees
multiple tokens come back.  This obliterates the OOTB experience for
non-whitespace languages.  And, it doesn't work right for
english... so I think we should deprecate the option and default it to
off.

Really the contrib queryparser is a better fit for doing rewrites like
this: it's able to operate on the abstract query tree, and can easily
do things like rewriting the query to add phrase queries...


 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867112#action_12867112
 ] 

Robert Muir commented on LUCENE-2458:
-

{quote}
This is why I like the token attr based solution
{quote}

I am, and will always be, -1 to this solution. Why can't we try to think about 
lucene from a proper internationalization architecture perspective? 

You shouldnt design apis around e-mail phenomena in english, thats absurd.

{quote}
BTW, this appears to not be an English-only need; this page
(http://www.seobythesea.com/?p=1206) lists these example languages as
also using English-like compound words: Some example languages that
use compound words include: Afrikaans, Danish, Dutch-Flemish, English,
Faroese, Frisian, High German, Gutnish, Icelandic, Low German,
Norwegian, Swedish, and Yiddish.
{quote}

Please don't try to insinuate that phrases are the way you should handle 
compound terms in these languages unless you have some actual evidence that 
they should be used instead of normal decompounding.

These languages have different syntax and word formation, and its simply not 
appropriate.


 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-13 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867117#action_12867117
 ] 

Uwe Schindler commented on LUCENE-2458:
---

Sorry for intervening,

I am in the same opinion like Hoss:
A lot of people are common to be able to create phrases in search engines by 
appending words with dashes (which StandardAnalyzer is perfectly doing with the 
current query parser impl). As quotes are slower to write, I e.g. always use 
this approach to search for phrases in Google this-is-a-phrase, which works 
always and brings identical results like this is a phrase (only ranking is 
sometimes slightly different in Google).

So we should have at least some possibility to switch the behavior on that 
creates phrase queries out of multiple tokens with posIncr0 -- but I am +1 on 
fixing the problem for non-whitespace languages like cjk. Its also broken, that 
QueryParser parses whitespace in its javacc grammar, in my opinion, this should 
be done by the analyzer (and not partly by analyzer and QP grammar).

 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-13 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867147#action_12867147
 ] 

Yonik Seeley commented on LUCENE-2458:
--

bq This is why I like the token attr based solution

+1

Although I think it's more general than de-compounding.
An attribute that says these tokens go together or these tokens should be 
considered one unit seems like nice generic functionality, and is unrelated to 
any specific language or search feature.


 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867151#action_12867151
 ] 

Robert Muir commented on LUCENE-2458:
-

{quote}
An attribute that says these tokens go together or these tokens should be 
considered one unit seems like nice generic functionality, and is unrelated to 
any specific language or search feature.
{quote}

No,  if they are one unit for search, they are one token.

Instead the tokenizer should be fixed so that they are one token, instead of 
making all languages suffer for the lack of a crappy english tokenizer.


 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866528#action_12866528
 ] 

Michael McCandless commented on LUCENE-2458:


This is sneaky behavior on QueryParser's part!  I didn't realize it did this.

What are some real use-cases where this is good?  WordDelmiterFilter seems 
like a good example (eg, Wi-Fi - Wi Fi).

It sounds like it's a very bad default for non-whitespace languages.

It seems like we should make it controllable, switch it under Version, and 
change the default going forward to not do this?

bq. Token Attributes could be used to instruct QueryParser as to the intent 
behind a stream of multiple tokens?

This seems like a good idea (since we seem to have real-world cases where it's 
very useful and others where it's very bad)?  Could/should it be per-analyzer?  
(ie, WDF would always do this but, say, ICUAnalyzer would never).  Or, 
per-token created?

 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Itamar Syn-Hershko
The QueryParser also fails to correctly parse Hebrew acronyms; although not
being an integral part of the current discussion, I thought this would be
the best place to bring that up.

Hebrew acronyms are assembled of letters with a single double-quote char
within, example: MNKL (Hebrew for CEO). That double-quote char usually
comes at the before-last position of the word, but for some cases it can
come before (MNKLIT). Since the QP expects two sets of double-quotes
enclosing a phrase, an exception will be thrown if such a word has been
passed to it, or an incorrect phrase query will be produced if two acronyms
are used together in a query string. Not sure which is worse.

Perhaps while you're at it you could make sure to only create a phrase query
if a quote is followed by a space - hence is definitely at the end of a
word, and not just assume it to be equivalent to a white space?

Although there's no good open Hebrew analyzer for Lucene yet hence no
motivation for this to be fixed, I'm working on one as we speak and
hopefully will have something to show in the next few weeks/days. It would
be nice to have at least this issue closed within the Lucene core code.

Thanks,

Itamar Syn-Hershko


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Robert Muir
On Wed, May 12, 2010 at 6:05 AM, Itamar Syn-Hershko ita...@code972.com wrote:
 The QueryParser also fails to correctly parse Hebrew acronyms; although not
 being an integral part of the current discussion, I thought this would be
 the best place to bring that up.


Just as I don't think Analysis should do QueryParsing, I don't think
QueryParsing should do Analysis either.
Similar problems to this exist in other languages (I have to escape :
for some, because lucene wants to interpret it as a field name).

But this can be easily remedied on the application side, its
documented and understood that the double-quote is a special
character, and there is an escape mechanism so you can escape the ones
you think are acronyms.

This issue is about about a buggy implementation: its not documented
and only internal to how the queryparser determines what is a phrase
query or not (and, contrary to what you would believe from the
documentation, the choice of whether or not to make a PhraseQuery is
not based on syntax one bit!)

-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Mark Miller

On 5/12/10 9:25 AM, Robert Muir wrote:

(and, contrary to what you would believe from the
documentation, the choice of whether or not to make a PhraseQuery is
not based on syntax one bit!)



Thats a major exaggeration - quoting text plays a large role in whether 
or not you will get a phrase query.



--
- Mark

http://www.lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Robert Muir
On Wed, May 12, 2010 at 11:16 AM, Mark Miller markrmil...@gmail.com wrote:

 Thats a major exaggeration - quoting text plays a large role in whether or
 not you will get a phrase query.


No, it has nothing to do with it in the implementation. It only
escapes the whitespace, but is discarded. This is clear from looking
at the grammar.

The logic then to determine if you get a phrase query is the huge mess
of code in getFieldQuery, but its not based on the double quotes at
all.

For example a list of chinese or thai words gets a phrase query, only
because they don't use whitespace between words.
But a similar list of english words gets a boolean query.

-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866595#action_12866595
 ] 

Marvin Humphrey commented on LUCENE-2458:
-

I have mixed feelings about this for English.  It's a weakness of our engine
that we do not take position of terms within a query string into account.  At
times I've tried to modify the scoring hierarchy to improve the situation, but
I gave up because it was too difficult.  This behavior of QueryParser is a
sneaky way of getting around that limitation by turning stuff which should
almost certainly be treated as phrase queries as such.  It's the one place 
where we actually exploit position data within the query string.

Mike's wi-fi example, though, wouldn't suffer that badly.  The terms wi
and fi are unlikely to occur much outside the context of 'wi-fi/wi fi/wifi'.
And treating wi-fi as a phrase still won't conflate results with wifi as
it would ideally.  

The example I would use doesn't typically apply to Lucene.  Lucene's
StandardAnalyzer tokenizes URLs as wholes, but KinoSearch's analogous analyzer
breaks them up into individual components.  As described in another recent
thread, this allows a search for 'example.com' to match a document which
contains the URL 'http://www.example.com/index.html'.  It would suck if all of
a sudden a search for 'example.com' started matching every document that
contained 'com'. 

You could, and theoretically should, address this problem with sophisticated
analysis.  But it does make it harder to write a good Analyzer.  You make it
more important to solve what Yonik calls the 'e space mail' problem by making
it worse.

 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866603#action_12866603
 ] 

Marvin Humphrey commented on LUCENE-2458:
-

 Because they show its 10x better to use this operator for Chinese

Another way to achieve this 10x improvement is to change how QP performs its
first stage of tokenization, as you and I discussed at ApacheCon Oakland.

Right now QP splits on whitespace.  If that behavior were customizable, e.g.
via a splitter Analyzer, then individual Han characters would get submitted
to getFieldQuery() -- and thus getFieldQuery() would no longer turn long
strings of Han characters into a PhraseQuery.  It seems wrong to continue to
push entire query strings from non-whitespace-delimited languages down into
getFieldQuery().

 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Ivan Provalov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1285#action_1285
 ] 

Ivan Provalov commented on LUCENE-2458:
---

Robert has asked me to post our test results on the Chinese Collection. We used 
the following data collection from TREC:

http://trec.nist.gov/data/qrels_noneng/index.html
qrels.trec6.29-54.chinese.gz
qrels.1-28.chinese.gz

http://trec.nist.gov/data/topics_noneng
TREC-6 Chinese topics (.gz)
TREC-5 Chinese topics (.gz)

Mandarin Data Collection
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2000T52

Analyzer Name Plain analyzers Added PositionFilter (only at query time)
ChineseAnalyzer 0.028 0.264
CJKAnalyzer 0.027 0.284
SmartChinese 0.027 0.265
IKAnalyzer 0.028 0.259

(Note: IKAnalyzer has its own IKQueryParser which yields 0.084 for the average 
precision)

Thanks,

Ivan Provalov

 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866693#action_12866693
 ] 

Marvin Humphrey commented on LUCENE-2458:
-

 I'm honestly having a tough time seeing where to proceed on this issue.

Change the initial split on whitespace to be customizable.  Override the
splitting behavior for non-whitespace-delimited languages and feed
getFieldQuery() smaller chunks.

That solves your problem without removing behavior most people believe to be
helpful.  Insisting on that orthogonal change is what is holding things up.


 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866696#action_12866696
 ] 

Hoss Man commented on LUCENE-2458:
--

bq. Instead the queryparser should only form phrasequeries when you use double 
quotes, just like the documentation says.

i'll grant you that the documentation is wrong -- i view that as a bug in the 
documentation, not the code.  

Saying that a PhraseQuery object should only ever be constructed if the user 
uses quotes is like saying that a BooleanQuery should only ever be constructed 
if the user specifies boolean operators -- there is no rule that the syntax 
must match the query structure, the same Query classes can serve multiple 
purposes.  The parser syntax should be what makes sense  for hte user, and the 
query structure constructed should be what makes sense for hte index, based on 
the syntax used by the user.

If i have built an index consisting entirely of ngrams, the end user shouldn't 
have to know that -- they shouldn't have to put every individual word in quotes 
to force a PhraseQuery to be constructed out of the ngram tokenstream produced 
by an individual word.

bq. Why isn't english treated this way too? I don't consider this bias towards 
english at all costs including preventing languages such as Chinese from 
working at all very fair, I think its a really ugly stance for Lucene to take.

I personally don't view it as an english bias ... to be it is a backwards 
compatibility bias  

I'm totally happy to make things onfigurable, but if two diametrically opposed 
behaviors are both equally useful, and If there is a choice needs to be made 
between leaving the default configuration the way the current hardcoded 
behavior is, or make the default the exact opposite of what the current 
hardcoded behavior is, it is then i would prefer to leave the default alone -- 
especially since this beahior has been around for so long, and many Analyzers 
and TOkenFilters, have been written with this behavior specificly in mind 
(several examples of this are in the Lucene code base -- and if we have them in 
our own code, you can be sure they exist in the wild of client code that 
would break if this behavior changes by default)

Once again: if this is a problem that can be solved per instance with token 
attributes, then by all means let's make *all* of the TokenFIlters that come 
out of the box implement this appropriately (english and non-english alike) 
so that people who change the default settings on the queryparser get the 
correct behavior regardless of langauge.  but all other things being equal 
lets keep the behavior working the way it has to avoid suprises.

bq. What are some real use-cases where this is good?

 * WordDelimiterFilter (wi-fi is a pathalogicaly bad example for this issue 
because as robert pointed out wi and fi don't tend to exist independently 
in english, but people tend to get anoyed when race-horse matches all docs 
containing race or horse)
 * single word to multiword synonym expansion/transformation (particularly 
acronym expansion: GE = General Electric)
 * Ngram indexing for fuzzy matching (if someone searches for the word 
billionaire they're going to be surprised to get documents containing lion)




 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their 

[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866695#action_12866695
 ] 

Robert Muir commented on LUCENE-2458:
-

{quote}
Change the initial split on whitespace to be customizable. Override the
splitting behavior for non-whitespace-delimited languages and feed
getFieldQuery() smaller chunks.
{quote}

Whitespace doesn't separate words in the majority of the world's languages, 
including english.

The responsibility should be instead on english to do its language-specific 
processing, not on everyone else to dodge it.

 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866698#action_12866698
 ] 

Robert Muir commented on LUCENE-2458:
-

bq. but all other things being equal lets keep the behavior working the way it 
has to avoid suprises.

This attitude makes me sick. The surprise is to the CJK users that get no 
results due to undocumented, english-specific hacks that people refuse to let 
go of.

 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Itamar Syn-Hershko
I think we understand each other perfectly well. I still think resolving
this is very simple, by just applying a correct logic (ignore double-quotes
followed by a char) which isn't enforced today and once it will be, it won't
cause any cases of unexpected behavior. This isn't an analysis related task,
and I'm not sure what  makes you insist so bad. I will be openning a
dedicated JIRA ticket for this discussion if this won't become part of the
current one.
 
Itamar.

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Thursday, May 13, 2010 1:42 AM
To: dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate
phrasequeries based on term count

On Wed, May 12, 2010 at 6:30 PM, Itamar Syn-Hershko ita...@code972.com
wrote:
 Never did I request the QP to do Analysis. I simply mentioned this bug 
 - what this definitely is -

Its definitely not a bug for Hebrew, there is a unicode character for
gershayim (U+05F4), so technically this should be used according to unicode.

Its arguably your responsibility to convert your data to unicode before
passing it thru Lucene, and that includes disambiguating when a double quote
should be gershayim

--
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread Robert Muir
Internationalization doesn't work by just piling hacks for language X,
language Y, and language Z on top of each other.

Just like I want the English hack removed, I strongly recommend
against adding any Hebrew hack.

On Wed, May 12, 2010 at 6:55 PM, Itamar Syn-Hershko ita...@code972.com wrote:
 I think we understand each other perfectly well. I still think resolving
 this is very simple, by just applying a correct logic (ignore double-quotes
 followed by a char) which isn't enforced today and once it will be, it won't
 cause any cases of unexpected behavior. This isn't an analysis related task,
 and I'm not sure what  makes you insist so bad. I will be openning a
 dedicated JIRA ticket for this discussion if this won't become part of the
 current one.

 Itamar.

 -Original Message-
 From: Robert Muir [mailto:rcm...@gmail.com]
 Sent: Thursday, May 13, 2010 1:42 AM
 To: dev@lucene.apache.org
 Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate
 phrasequeries based on term count

 On Wed, May 12, 2010 at 6:30 PM, Itamar Syn-Hershko ita...@code972.com
 wrote:
 Never did I request the QP to do Analysis. I simply mentioned this bug
 - what this definitely is -

 Its definitely not a bug for Hebrew, there is a unicode character for
 gershayim (U+05F4), so technically this should be used according to unicode.

 Its arguably your responsibility to convert your data to unicode before
 passing it thru Lucene, and that includes disambiguating when a double quote
 should be gershayim

 --
 Robert Muir
 rcm...@gmail.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-12 Thread DM Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866954#action_12866954
 ] 

DM Smith commented on LUCENE-2458:
--

As I see it there are two issues:
1) Backward compatibility. 
2) Correctness according to the syntax definition of a query.

Let me preface the following by saying I have not studied the query parser in 
Lucene. Over 20 years ago I got an MS in compiler writing. I've been away from 
it for quite a while.

So, IMHO as a former compiler writer:

Maybe I'm just not getting it but it should be trivial to define the grammar 
(w/ precedence for any ambiguity, if necessary) and implement it. The tokenizer 
for the parser should have the responsibility to break the input into sequences 
of meta and non-meta. This tokenizer should not be anything more than what the 
parser requires.

The non-meta reasonably is subject to further tokenization/analysis. This 
further analysis should be entirely under the user's control. It should not be 
part of the parser.

Regarding the issue, I think it would be best if a quotation was the sole 
criteria for the determination of what is a phrase, not some heuristical 
analysis of the token stream.


 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-11 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866341#action_12866341
 ] 

Hoss Man commented on LUCENE-2458:
--

Robter: do you have a specific suggestion for what QueryParser should do if a 
single chunk of input causes the Analyzer to produce multiple tokens that are 
not at the same position (ie: the current case where QueryParser produces a 
PhraseQuery even if there are no quotes)

Ie: if the query parser is asked to parse... 
{code}fieldName:A-Field-Value{code}
...and the Analyzer produces three tokens...
 * A (at position 0)
 * Field (at position 1)
 * Value (at position 2)

...what should the resulting Query object be?

 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866353#action_12866353
 ] 

Robert Muir commented on LUCENE-2458:
-

bq. ...what should the resulting Query object be?

a Boolean Query formed with the default operator.


 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-11 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866363#action_12866363
 ] 

Hoss Man commented on LUCENE-2458:
--

bq. a Boolean Query formed with the default operator.

That seems like equally bad default behavior -- lots of existing TokenFilters 
produce chains of tokens for situations where the user creating the query 
string clearly intended to be searching for a single word and has no idea 
that as an implementation detail multiple tokens were produced under the covers 
(ie: WordDelimiterFilter, Ngrams, etc...)

I haven't thought this through very well, but perhaps this is an area where 
(the new) Token Attributes could be used to instruct QueryParser as to the 
intent behind a stream of multiple tokens?  A new Attribute could be used on 
each token to convey when that token should be combined with teh previous 
token, and in what way: as a phrase, as a conjunction or as a disjunction.  
(this could still be orthogonal to the position, which would indicate slop/span 
type information like it does currently)

Stock Analysys components that produce multiple tokens could be modified to add 
this attribute fairly easily (it should be a relatively static value for any 
component that currently splits tokens) and QueryParser could have an option 
controlling what to do if  it encounters a token w/o this attribute (perhaps 
even two options: one for quoted input chunks and one for unquoted input 
chunks).

that way the default could still work in a back compatible way, but people 
using languages that don't use whitespace separation *and* are using older (or 
custom) analyzers that don't know about this attribute could set a simple query 
parser property to force this behavior.

would that make sense? (asks the man who only vaguely understands Token 
Attributes at this point)

 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866368#action_12866368
 ] 

Robert Muir commented on LUCENE-2458:
-

bq. That seems like equally bad default behavior

Do you have measurements to support this? Because they show its 10x better to 
use this operator for Chinese :)

bq. I haven't thought this through very well, but perhaps this is an area where 
(the new) Token Attributes

I disagree. Instead the queryparser should only form phrasequeries when you use 
double quotes, just like the documentation says.

 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

2010-05-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12866374#action_12866374
 ] 

Robert Muir commented on LUCENE-2458:
-

by the way hoss man you said it best yourself:

{quote}
lots of existing TokenFilters produce chains of tokens for situations where the 
user creating the query string clearly intended to be searching for a single 
word and has no idea that as an implementation detail multiple tokens were 
produced under the covers (ie: WordDelimiterFilter, Ngrams, etc...)
{quote}

User clearly intended is wrong. WordDelimiterFilter will break tibetan text in 
a similar manner (it uses no spaces between words), yet no user clearly 
intended to form phrase queries.

Users clearly intend to form phrase queries only when they use the phrase query 
operator, thats how the query parser is documented to work, and its a bug that 
it doesnt work that way.

 queryparser shouldn't generate phrasequeries based on term count
 

 Key: LUCENE-2458
 URL: https://issues.apache.org/jira/browse/LUCENE-2458
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
Priority: Critical

 The current method in the queryparser to generate phrasequeries is wrong:
 The Query Syntax documentation 
 (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
 {noformat}
 A Phrase is a group of words surrounded by double quotes such as hello 
 dolly.
 {noformat}
 But as we know, this isn't actually true.
 Instead the terms are first divided on whitespace, then the analyzer term 
 count is used as some sort of heuristic to determine if its a phrase query 
 or not.
 This assumption is a disaster for languages that don't use whitespace 
 separation: CJK, compounding European languages like German, Finnish, etc. It 
 also
 makes it difficult for people to use n-gram analysis techniques. In these 
 cases you get bad relevance (MAP improves nearly *10x* if you use a 
 PositionFilter at query-time to turn this off for chinese).
 For even english, this undocumented behavior is bad. Perhaps in some cases 
 its being abused as some heuristic to second guess the tokenizer and piece 
 back things it shouldn't have split, but for large collections, doing things 
 like generating phrasequeries because StandardTokenizer split a compound on a 
 dash can cause serious performance problems. Instead people should analyze 
 their text with the appropriate methods, and QueryParser should only generate 
 phrase queries when the syntax asks for one.
 The PositionFilter in contrib can be seen as a workaround, but its pretty 
 obscure and people are not familiar with it. The result is we have bad 
 out-of-box behavior for many languages, and bad performance for others on 
 some inputs.
 I propose instead that we change the grammar to actually look for double 
 quotes to determine when to generate a phrase query, consistent with the 
 documentation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org