[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace

2016-07-05 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-2605:
---
Attachment: LUCENE-2605-dont-split-by-default.patch

Patch for master only that switches the default split-on-whitespace behavior 
from *do* to *don't*.

Committing shortly.

> queryparser parses on whitespace
> 
>
> Key: LUCENE-2605
> URL: https://issues.apache.org/jira/browse/LUCENE-2605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Robert Muir
>Assignee: Steve Rowe
> Attachments: LUCENE-2605-dont-split-by-default.patch, 
> LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch, 
> LUCENE-2605.patch, LUCENE-2605.patch
>
>
> The queryparser parses input on whitespace, and sends each whitespace 
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across 
> whitespace boundaries:
> * n-gram analysis
> * shingles 
> * synonyms (especially multi-word for whitespace-separated languages)
> * languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their 
> charfilters/tokenizers/tokenfilters will do the same thing at index and 
> querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse 
> around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace

2016-06-30 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-2605:
---
Attachment: LUCENE-2605.patch

Okay, really final patch.  On SOLR-9185 I was having trouble integrating the 
Solr standard QP's comment support with the whitespace tokenization I 
introduced here, so I tried switching the Solr parser back to ignoring both 
whitespace and comments, and it worked.  The patch brings this grammar 
simplification back here too - in addition to many fewer whitespace mentions in 
the rules, fewer (and less complicated) lookaheads are required.

I've included the generated files in the patch.

No tests changed from the last patch.

All Lucene tests pass, and precommit passes.

> queryparser parses on whitespace
> 
>
> Key: LUCENE-2605
> URL: https://issues.apache.org/jira/browse/LUCENE-2605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Robert Muir
>Assignee: Steve Rowe
> Attachments: LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch, 
> LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch
>
>
> The queryparser parses input on whitespace, and sends each whitespace 
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across 
> whitespace boundaries:
> * n-gram analysis
> * shingles 
> * synonyms (especially multi-word for whitespace-separated languages)
> * languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their 
> charfilters/tokenizers/tokenfilters will do the same thing at index and 
> querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse 
> around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace

2016-06-27 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-2605:
---
Attachment: LUCENE-2605.patch

Patch adds lucene-test-framework files missing from last version of the patch.  
Also adds a CHANGES entry.

I plan on committing in a couple days if there are no objections.

> queryparser parses on whitespace
> 
>
> Key: LUCENE-2605
> URL: https://issues.apache.org/jira/browse/LUCENE-2605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Robert Muir
>Assignee: Steve Rowe
> Attachments: LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch, 
> LUCENE-2605.patch, LUCENE-2605.patch
>
>
> The queryparser parses input on whitespace, and sends each whitespace 
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across 
> whitespace boundaries:
> * n-gram analysis
> * shingles 
> * synonyms (especially multi-word for whitespace-separated languages)
> * languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their 
> charfilters/tokenizers/tokenfilters will do the same thing at index and 
> querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse 
> around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace

2016-06-24 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-2605:
---
Attachment: LUCENE-2605.patch

Patch adding an option to preserve the old behavior: 
{{\{set/get\}SplitOnWhitespace()}}, defaulting to {{true}} (the current 
behavior).

Though nobody said so here, on the Solr issue (SOLR-9185), a couple people 
mentioned that the old behavior should be preserved, and not be the default 
until a major release.

That's what this patch does.

> queryparser parses on whitespace
> 
>
> Key: LUCENE-2605
> URL: https://issues.apache.org/jira/browse/LUCENE-2605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Robert Muir
>Assignee: Steve Rowe
> Attachments: LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch, 
> LUCENE-2605.patch
>
>
> The queryparser parses input on whitespace, and sends each whitespace 
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across 
> whitespace boundaries:
> * n-gram analysis
> * shingles 
> * synonyms (especially multi-word for whitespace-separated languages)
> * languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their 
> charfilters/tokenizers/tokenfilters will do the same thing at index and 
> querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse 
> around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace

2016-06-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-2605:
---
Fix Version/s: (was: 6.0)
   (was: 4.9)

> queryparser parses on whitespace
> 
>
> Key: LUCENE-2605
> URL: https://issues.apache.org/jira/browse/LUCENE-2605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Robert Muir
>Assignee: Steve Rowe
> Attachments: LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch
>
>
> The queryparser parses input on whitespace, and sends each whitespace 
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across 
> whitespace boundaries:
> * n-gram analysis
> * shingles 
> * synonyms (especially multi-word for whitespace-separated languages)
> * languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their 
> charfilters/tokenizers/tokenfilters will do the same thing at index and 
> querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse 
> around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace

2016-06-03 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-2605:
---
Attachment: LUCENE-2605.patch

Patch, I think it's ready.

I've pulled MockSynonymFilter/Analyzer out into their own files in 
lucene-test-framework, and added tests for it, and added a fixed multi-term 
source synonym with one single-term target.

I added tests to {{TestQueryParser}} using the modified MockSynonymAnalyzer 
ensuring operators block multi-term analysis when they should and don't when 
they shouldn't.

I'll go make issues now for converting Solr's clone of this QueryParser, and 
the standard flexible query parser, to add the same capabilities.

> queryparser parses on whitespace
> 
>
> Key: LUCENE-2605
> URL: https://issues.apache.org/jira/browse/LUCENE-2605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Robert Muir
>Assignee: Steve Rowe
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch
>
>
> The queryparser parses input on whitespace, and sends each whitespace 
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across 
> whitespace boundaries:
> * n-gram analysis
> * shingles 
> * synonyms (especially multi-word for whitespace-separated languages)
> * languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their 
> charfilters/tokenizers/tokenfilters will do the same thing at index and 
> querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse 
> around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace

2016-05-12 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-2605:
---
Attachment: LUCENE-2605.patch

Patch fixing up two problems: 

# Multiple whitespace-separated terms' TermQuery-s within a BooleanClause are 
now flattened directly into the output Query list, rather than inserting the 
BooleanClause.
# MultiFieldQuery's getFieldQuery() is modified to recombine multiple terms 
from each field's query, to produce a series of disjunctions of  term against 
each field.

All queryparser module tests now pass, with the exception of the flexible query 
parser's TestStandardQP run with QueryParserTestBase.testQPA().  Since this 
patch doesn't modify anything about the flexible query parser, this is not 
surprising.

> queryparser parses on whitespace
> 
>
> Key: LUCENE-2605
> URL: https://issues.apache.org/jira/browse/LUCENE-2605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Robert Muir
>Assignee: Steve Rowe
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-2605.patch, LUCENE-2605.patch
>
>
> The queryparser parses input on whitespace, and sends each whitespace 
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across 
> whitespace boundaries:
> * n-gram analysis
> * shingles 
> * synonyms (especially multi-word for whitespace-separated languages)
> * languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their 
> charfilters/tokenizers/tokenfilters will do the same thing at index and 
> querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse 
> around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace

2016-05-10 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-2605:
---
Attachment: LUCENE-2605.patch

Initial patch against the classic Lucene QueryParser grammar only.  Runs of 
whitespace-separated text with no operators are now sent as a single input to 
the analyzer, rather than first splitting on whitespace. 

There are 10 failing tests in the queryparser module (didn't try anywhere else 
yet), mostly caused by the way getFieldQuery() produces a nested query when 
there are multiple terms.  I'll look into how to address that.


> queryparser parses on whitespace
> 
>
> Key: LUCENE-2605
> URL: https://issues.apache.org/jira/browse/LUCENE-2605
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser
>Reporter: Robert Muir
>Assignee: Steve Rowe
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-2605.patch
>
>
> The queryparser parses input on whitespace, and sends each whitespace 
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across 
> whitespace boundaries:
> * n-gram analysis
> * shingles 
> * synonyms (especially multi-word for whitespace-separated languages)
> * languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their 
> charfilters/tokenizers/tokenfilters will do the same thing at index and 
> querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse 
> around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace

2014-03-15 Thread David Smiley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated LUCENE-2605:
-

Fix Version/s: (was: 4.7)
   4.8

 queryparser parses on whitespace
 

 Key: LUCENE-2605
 URL: https://issues.apache.org/jira/browse/LUCENE-2605
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/queryparser
Reporter: Robert Muir
 Fix For: 4.8


 The queryparser parses input on whitespace, and sends each whitespace 
 separated term to its own independent token stream.
 This breaks the following at query-time, because they can't see across 
 whitespace boundaries:
 * n-gram analysis
 * shingles 
 * synonyms (especially multi-word for whitespace-separated languages)
 * languages where a 'word' can contain whitespace (e.g. vietnamese)
 Its also rather unexpected, as users think their 
 charfilters/tokenizers/tokenfilters will do the same thing at index and 
 querytime, but
 in many cases they can't. Instead, preferably the queryparser would parse 
 around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace

2013-05-09 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2605:
--

Fix Version/s: (was: 4.3)
   4.4

 queryparser parses on whitespace
 

 Key: LUCENE-2605
 URL: https://issues.apache.org/jira/browse/LUCENE-2605
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/queryparser
Reporter: Robert Muir
 Fix For: 4.4


 The queryparser parses input on whitespace, and sends each whitespace 
 separated term to its own independent token stream.
 This breaks the following at query-time, because they can't see across 
 whitespace boundaries:
 * n-gram analysis
 * shingles 
 * synonyms (especially multi-word for whitespace-separated languages)
 * languages where a 'word' can contain whitespace (e.g. vietnamese)
 Its also rather unexpected, as users think their 
 charfilters/tokenizers/tokenfilters will do the same thing at index and 
 querytime, but
 in many cases they can't. Instead, preferably the queryparser would parse 
 around only real 'operators'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2605) queryparser parses on whitespace

2011-01-16 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2605:


Fix Version/s: (was: 3.1)

 queryparser parses on whitespace
 

 Key: LUCENE-2605
 URL: https://issues.apache.org/jira/browse/LUCENE-2605
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
 Fix For: 4.0


 The queryparser parses input on whitespace, and sends each whitespace 
 separated term to its own independent token stream.
 This breaks the following at query-time, because they can't see across 
 whitespace boundaries:
 * n-gram analysis
 * shingles 
 * synonyms (especially multi-word for whitespace-separated languages)
 * languages where a 'word' can contain whitespace (e.g. vietnamese)
 Its also rather unexpected, as users think their 
 charfilters/tokenizers/tokenfilters will do the same thing at index and 
 querytime, but
 in many cases they can't. Instead, preferably the queryparser would parse 
 around only real 'operators'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2605) queryparser parses on whitespace

2010-09-23 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2605:


Component/s: QueryParser

 queryparser parses on whitespace
 

 Key: LUCENE-2605
 URL: https://issues.apache.org/jira/browse/LUCENE-2605
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Reporter: Robert Muir
 Fix For: 3.1, 4.0


 The queryparser parses input on whitespace, and sends each whitespace 
 separated term to its own independent token stream.
 This breaks the following at query-time, because they can't see across 
 whitespace boundaries:
 * n-gram analysis
 * shingles 
 * synonyms (especially multi-word for whitespace-separated languages)
 * languages where a 'word' can contain whitespace (e.g. vietnamese)
 Its also rather unexpected, as users think their 
 charfilters/tokenizers/tokenfilters will do the same thing at index and 
 querytime, but
 in many cases they can't. Instead, preferably the queryparser would parse 
 around only real 'operators'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org