[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser

2014-11-19 Thread Modassar Ather (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217708#comment-14217708
 ] 

Modassar Ather edited comment on LUCENE-5205 at 11/19/14 10:45 AM:
---

Thanks [~talli...@apache.org] for your response. I am using the SpanQuryParser 
and fix for query hanging issue from your github site as provided in your 
comment.
I am using WhiteSpaceTokenizer.

With WhiteSpaceTokenizer:
q=field: (SEARCH TOOLS PROVIDER  CONSULTING COMPANY) still gets transformed 
to following:
+spanNear([field:search, field:tools, field:provider, field:, field:consulting, 
field:company], 0, true)

I am trying to find the possible cause of the removal of '' in my config.


was (Author: modassar):
Thanks [~talli...@apache.org] for your response. I am using the SpanQuryParser 
and fix for query hanging issue from your github site as provided in your 
comment.
I tried using WhiteSpaceTokenizer. Will check with StandardAnalyzer too.

With WhiteSpaceTokenizer:
q=field: (SEARCH TOOLS PROVIDER  CONSULTING COMPANY) still gets transformed 
to following:
+spanNear([field:search, field:tools, field:provider, field:, field:consulting, 
field:company], 0, true)

 [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to 
 classic QueryParser
 ---

 Key: LUCENE-5205
 URL: https://issues.apache.org/jira/browse/LUCENE-5205
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser
Reporter: Tim Allison
  Labels: patch
 Fix For: 4.9

 Attachments: LUCENE-5205-cleanup-tests.patch, 
 LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, 
 LUCENE-5205_dateTestReInitPkgPrvt.patch, 
 LUCENE-5205_improve_stop_word_handling.patch, 
 LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, 
 SpanQueryParser_v1.patch.gz, patch.txt


 This parser extends QueryParserBase and includes functionality from:
 * Classic QueryParser: most of its syntax
 * SurroundQueryParser: recursive parsing for near and not clauses.
 * ComplexPhraseQueryParser: can handle near queries that include multiterms 
 (wildcard, fuzzy, regex, prefix),
 * AnalyzingQueryParser: has an option to analyze multiterms.
 At a high level, there's a first pass BooleanQuery/field parser and then a 
 span query parser handles all terminal nodes and phrases.
 Same as classic syntax:
 * term: test 
 * fuzzy: roam~0.8, roam~2
 * wildcard: te?t, test*, t*st
 * regex: /\[mb\]oat/
 * phrase: jakarta apache
 * phrase with slop: jakarta apache~3
 * default or clause: jakarta apache
 * grouping or clause: (jakarta apache)
 * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
 * multiple fields: title:lucene author:hatcher
  
 Main additions in SpanQueryParser syntax vs. classic syntax:
 * Can require in order for phrases with slop with the \~ operator: 
 jakarta apache\~3
 * Can specify not near: fever bieber!\~3,10 ::
 find fever but not if bieber appears within 3 words before or 10 
 words after it.
 * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta 
 apache\]~3 lucene\]\~4 :: 
 find jakarta within 3 words of apache, and that hit has to be within 
 four words before lucene
 * Can also use \[\] for single level phrasal queries instead of  as in: 
 \[jakarta apache\]
 * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 
 :: find apache and then either lucene or solr within three words.
 * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2
 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ 
 /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within two 
 words of ap*che and that hit has to be within ten words of something like 
 solr or that lucene regex.
 * Can require at least x number of hits at boolean level: apache AND (lucene 
 solr tika)~2
 * Can use negative only query: -jakarta :: Find all docs that don't contain 
 jakarta
 * Can use an edit distance  2 for fuzzy query via SlowFuzzyQuery (beware of 
 potential performance issues!).
 Trivial additions:
 * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, 
 prefix =2)
 * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance 
 =2: (jakarta~1 (OSA) vs jakarta~1(Levenshtein)
 This parser can be very useful for concordance tasks (see also LUCENE-5317 
 and LUCENE-5318) and for analytical search.  
 Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery.
 Most of the documentation is in the javadoc for SpanQueryParser.
 Any and all feedback is welcome.  Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser

2014-11-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217793#comment-14217793
 ] 

Tim Allison edited comment on LUCENE-5205 at 11/19/14 11:58 AM:


Good to hear the github workaround works.  If a committer has any interest in 
taking this on, it would be great to merge this into trunk...and then we could 
deprecate AnalyzingQueryParser, SurroundQueryParser and 
ComplexPhraseQueryParser just in time for 5.0. :)

In pure Lucene, with a WhitespaceAnalyzer, the '' is still making it through 
the parsing process.

{noformat}
spanNear([field:SEARCH, field:TOOLS, field:PROVIDER, field:, field:CONSULTING, 
field:COMPANY], 0, true)
{noformat}

When I use a StandardAnalyzer, the '' is correctly dropped:
{noformat}
spanNear([field:search, field:tools, field:provider, field:consulting, 
field:company], 1, true)
{noformat}

What filters are you applying?  From your output, at least the 
LowerCaseFilterFactory, but anything else?


was (Author: talli...@mitre.org):
Good to hear the github workaround works.  If a committer has any interest in 
taking this on, it would be great to merge this into trunk...and then we could 
deprecate AnalyzingQueryParser, SurroundQueryParser and 
ComplexPhraseQueryParser just in time for 5.0. :)

In pure Lucene, with a WhitespaceAnalyzer, the '' is still making it through 
the parsing process.

{noformat}
spanNear([field:SEARCH, field:TOOLS, field:PROVIDER, field:, field:CONSULTING, 
field:COMPANY], 0, true)
{noformat}

What filters are you applying?  From your output, at least the 
LowerCaseFilterFactory, but anything else?

 [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to 
 classic QueryParser
 ---

 Key: LUCENE-5205
 URL: https://issues.apache.org/jira/browse/LUCENE-5205
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser
Reporter: Tim Allison
  Labels: patch
 Fix For: 4.9

 Attachments: LUCENE-5205-cleanup-tests.patch, 
 LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, 
 LUCENE-5205_dateTestReInitPkgPrvt.patch, 
 LUCENE-5205_improve_stop_word_handling.patch, 
 LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, 
 SpanQueryParser_v1.patch.gz, patch.txt


 This parser extends QueryParserBase and includes functionality from:
 * Classic QueryParser: most of its syntax
 * SurroundQueryParser: recursive parsing for near and not clauses.
 * ComplexPhraseQueryParser: can handle near queries that include multiterms 
 (wildcard, fuzzy, regex, prefix),
 * AnalyzingQueryParser: has an option to analyze multiterms.
 At a high level, there's a first pass BooleanQuery/field parser and then a 
 span query parser handles all terminal nodes and phrases.
 Same as classic syntax:
 * term: test 
 * fuzzy: roam~0.8, roam~2
 * wildcard: te?t, test*, t*st
 * regex: /\[mb\]oat/
 * phrase: jakarta apache
 * phrase with slop: jakarta apache~3
 * default or clause: jakarta apache
 * grouping or clause: (jakarta apache)
 * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
 * multiple fields: title:lucene author:hatcher
  
 Main additions in SpanQueryParser syntax vs. classic syntax:
 * Can require in order for phrases with slop with the \~ operator: 
 jakarta apache\~3
 * Can specify not near: fever bieber!\~3,10 ::
 find fever but not if bieber appears within 3 words before or 10 
 words after it.
 * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta 
 apache\]~3 lucene\]\~4 :: 
 find jakarta within 3 words of apache, and that hit has to be within 
 four words before lucene
 * Can also use \[\] for single level phrasal queries instead of  as in: 
 \[jakarta apache\]
 * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 
 :: find apache and then either lucene or solr within three words.
 * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2
 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ 
 /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within two 
 words of ap*che and that hit has to be within ten words of something like 
 solr or that lucene regex.
 * Can require at least x number of hits at boolean level: apache AND (lucene 
 solr tika)~2
 * Can use negative only query: -jakarta :: Find all docs that don't contain 
 jakarta
 * Can use an edit distance  2 for fuzzy query via SlowFuzzyQuery (beware of 
 potential performance issues!).
 Trivial additions:
 * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, 
 prefix =2)
 * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance 
 =2: (jakarta~1 (OSA) vs jakarta~1(Levenshtein)
 This 

[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser

2014-11-18 Thread Modassar Ather (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216113#comment-14216113
 ] 

Modassar Ather edited comment on LUCENE-5205 at 11/18/14 11:59 AM:
---

I am trying following queries and facing an issue for which need your 
suggestions. The environment is 4 shard cluster with embedded zookeeper on one 
of them.

q=field: (SEARCH TOOLS PROVIDER  CONSULTING COMPANY) Gets transformed to 
following:
+spanNear([field:search, field:tools, field:provider, field:, field:consulting, 
field:company], 0, true)

field: (SEARCH TOOL'S PROVIDER'S AND CONSULTING COMPANY) Gets transformed to 
following:
+spanNear([field:search, spanNear([field:s, field:provider], 0, true), field:s, 
field:and, field:consulting, field:company], 0, true)

field: (SEARCH TOOL'S SOLUTION PROVIDER TECHNOLOGY CO., LTD.) Gets stuck and 
does not return. We have set query timeAllowed to 5 minutes but it seems that 
it is not reaching here and continues.
During debug I found that it gets stuck at m.find(), Line 154 of SpanQueryLexer 
after it has created token for double quotes and term SEARCH.

Whereas the above query without (') gets transformed to following
field: (SEARCH TOOLS SOLUTION PROVIDER TECHNOLOGY CO., LTD.) = 
+spanNear([field:search, field:tools, field:solution, field:provider, 
field:technology, field:co, field:ltd], 0, true)

Need your help in understanding if I am not using the query properly or it can 
be an issue.


was (Author: modassar):
I am trying following queries and facing an issue for which need your 
suggestions. The environment is 4 shard cluster with embedded zookeeper on one 
of them.

q=field:(SEARCH TOOLS PROVIDER  CONSULTING COMPANY) Gets transformed to 
following:
+spanNear([field:search, field:tools, field:provider, field:, field:consulting, 
field:company], 0, true)

field:(SEARCH TOOL'S PROVIDER'S AND CONSULTING COMPANY) Gets transformed to 
following:
+spanNear([field:search, spanNear([field:s, field:provider], 0, true), field:s, 
field:and, field:consulting, field:company], 0, true)

field:(SEARCH TOOL'S SOLUTION PROVIDER TECHNOLOGY CO., LTD.) Gets stuck and 
does not return. We have set query timeAllowed to 5 minutes but it seems that 
it is not reaching here and continues.
During debug I found that it gets stuck at m.find(), Line 154 of SpanQueryLexer 
after it has created token for double quotes and term SEARCH.

Whereas the above query without (') gets transformed to following
field:(SEARCH TOOLS SOLUTION PROVIDER TECHNOLOGY CO., LTD.) = 
+spanNear([field:search, field:tools, field:solution, field:provider, 
field:technology, field:co, field:ltd], 0, true)

Need your help in understanding if I am not using the query properly or it can 
be an issue.

 [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to 
 classic QueryParser
 ---

 Key: LUCENE-5205
 URL: https://issues.apache.org/jira/browse/LUCENE-5205
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser
Reporter: Tim Allison
  Labels: patch
 Fix For: 4.9

 Attachments: LUCENE-5205-cleanup-tests.patch, 
 LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, 
 LUCENE-5205_dateTestReInitPkgPrvt.patch, 
 LUCENE-5205_improve_stop_word_handling.patch, 
 LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, 
 SpanQueryParser_v1.patch.gz, patch.txt


 This parser extends QueryParserBase and includes functionality from:
 * Classic QueryParser: most of its syntax
 * SurroundQueryParser: recursive parsing for near and not clauses.
 * ComplexPhraseQueryParser: can handle near queries that include multiterms 
 (wildcard, fuzzy, regex, prefix),
 * AnalyzingQueryParser: has an option to analyze multiterms.
 At a high level, there's a first pass BooleanQuery/field parser and then a 
 span query parser handles all terminal nodes and phrases.
 Same as classic syntax:
 * term: test 
 * fuzzy: roam~0.8, roam~2
 * wildcard: te?t, test*, t*st
 * regex: /\[mb\]oat/
 * phrase: jakarta apache
 * phrase with slop: jakarta apache~3
 * default or clause: jakarta apache
 * grouping or clause: (jakarta apache)
 * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
 * multiple fields: title:lucene author:hatcher
  
 Main additions in SpanQueryParser syntax vs. classic syntax:
 * Can require in order for phrases with slop with the \~ operator: 
 jakarta apache\~3
 * Can specify not near: fever bieber!\~3,10 ::
 find fever but not if bieber appears within 3 words before or 10 
 words after it.
 * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta 
 apache\]~3 lucene\]\~4 :: 
 find jakarta within 3 words of apache, and 

[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser

2014-11-18 Thread Modassar Ather (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216113#comment-14216113
 ] 

Modassar Ather edited comment on LUCENE-5205 at 11/18/14 12:01 PM:
---

I am trying following queries and facing an issue for which need your 
suggestions. The environment is 4 shard cluster with embedded zookeeper on one 
of them.

q=field: (SEARCH TOOLS PROVIDER  CONSULTING COMPANY) Gets transformed to 
following:
+spanNear([field:search, field:tools, field:provider, field:, field:consulting, 
field:company], 0, true)

field: (SEARCH TOOL'S PROVIDER'S AND CONSULTING COMPANY) Gets transformed to 
following:
+spanNear([field:search, spanNear([field:s, field:provider], 0, true), field:s, 
field:and, field:consulting, field:company], 0, true)

field: (SEARCH TOOL'S SOLUTION PROVIDER TECHNOLOGY CO., LTD.) Gets stuck and 
does not return. We have set query timeAllowed to 5 minutes but it seems that 
it is not reaching here and continues.
During debug I found that it gets stuck at m.find(), Line 154 of SpanQueryLexer 
after it has created token for double quotes and term SEARCH.

Whereas the above query without (') gets transformed to following
field: (SEARCH TOOLS SOLUTION PROVIDER TECHNOLOGY CO., LTD.) = 
+spanNear([field:search, field:tools, field:solution, field:provider, 
field:technology, field:co, field:ltd], 0, true)

Need your help in understanding if I am not using the query properly or it can 
be an issue.
NOTE: A space between the field: and query is added to avoid transformation to 
smileys.


was (Author: modassar):
I am trying following queries and facing an issue for which need your 
suggestions. The environment is 4 shard cluster with embedded zookeeper on one 
of them.

q=field: (SEARCH TOOLS PROVIDER  CONSULTING COMPANY) Gets transformed to 
following:
+spanNear([field:search, field:tools, field:provider, field:, field:consulting, 
field:company], 0, true)

field: (SEARCH TOOL'S PROVIDER'S AND CONSULTING COMPANY) Gets transformed to 
following:
+spanNear([field:search, spanNear([field:s, field:provider], 0, true), field:s, 
field:and, field:consulting, field:company], 0, true)

field: (SEARCH TOOL'S SOLUTION PROVIDER TECHNOLOGY CO., LTD.) Gets stuck and 
does not return. We have set query timeAllowed to 5 minutes but it seems that 
it is not reaching here and continues.
During debug I found that it gets stuck at m.find(), Line 154 of SpanQueryLexer 
after it has created token for double quotes and term SEARCH.

Whereas the above query without (') gets transformed to following
field: (SEARCH TOOLS SOLUTION PROVIDER TECHNOLOGY CO., LTD.) = 
+spanNear([field:search, field:tools, field:solution, field:provider, 
field:technology, field:co, field:ltd], 0, true)

Need your help in understanding if I am not using the query properly or it can 
be an issue.

 [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to 
 classic QueryParser
 ---

 Key: LUCENE-5205
 URL: https://issues.apache.org/jira/browse/LUCENE-5205
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser
Reporter: Tim Allison
  Labels: patch
 Fix For: 4.9

 Attachments: LUCENE-5205-cleanup-tests.patch, 
 LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, 
 LUCENE-5205_dateTestReInitPkgPrvt.patch, 
 LUCENE-5205_improve_stop_word_handling.patch, 
 LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, 
 SpanQueryParser_v1.patch.gz, patch.txt


 This parser extends QueryParserBase and includes functionality from:
 * Classic QueryParser: most of its syntax
 * SurroundQueryParser: recursive parsing for near and not clauses.
 * ComplexPhraseQueryParser: can handle near queries that include multiterms 
 (wildcard, fuzzy, regex, prefix),
 * AnalyzingQueryParser: has an option to analyze multiterms.
 At a high level, there's a first pass BooleanQuery/field parser and then a 
 span query parser handles all terminal nodes and phrases.
 Same as classic syntax:
 * term: test 
 * fuzzy: roam~0.8, roam~2
 * wildcard: te?t, test*, t*st
 * regex: /\[mb\]oat/
 * phrase: jakarta apache
 * phrase with slop: jakarta apache~3
 * default or clause: jakarta apache
 * grouping or clause: (jakarta apache)
 * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
 * multiple fields: title:lucene author:hatcher
  
 Main additions in SpanQueryParser syntax vs. classic syntax:
 * Can require in order for phrases with slop with the \~ operator: 
 jakarta apache\~3
 * Can specify not near: fever bieber!\~3,10 ::
 find fever but not if bieber appears within 3 words before or 10 
 words after it.
 * Fully recursive phrasal queries with \[ and \]; as in: 

[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser

2014-11-18 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216571#comment-14216571
 ] 

Tim Allison edited comment on LUCENE-5205 at 11/18/14 6:47 PM:
---

{quote}
field: (SEARCH TOOL'S PROVIDER'S AND CONSULTING COMPANY) Gets transformed to 
following:
 +spanNear([field:search, spanNear([field:s, field:provider], 0, true), 
field:s, field:and, field:consulting, field:company], 0, true)
{quote}
Unfortunately, I can't think of a way around this.  In the SpanQueryParser, 
single quotes should be used to mark a token that should not be further parsed, 
i.e. '/files/a/b/c/path.html' should be treated as a string not a regex.  I 
toyed with requiring a space before the start ' and space after the ', but that 
seemed hacky.

If you escape your apostrophes, you should get the results you expect (this is 
with a whitespace analyzer, you may get different results with 
StandardAnalyzer):
{noformat} SEARCH TOOL\\'S SOLUTION PROVIDER\\'S TECHNOLOGY CO., LTD{noformat}
yields:
{noformat}
f1:search f1:tool's f1:solution f1:provider's f1:technology f1:co., f1:ltd
{noformat}

{quote}q=field: (SEARCH TOOLS PROVIDER  CONSULTING COMPANY) Gets transformed 
to following:
 +spanNear([field:search, field:tools, field:provider, field:, 
field:consulting, field:company], 0, true)
{quote}
I think this is fixed on github.  What Analyzer chain are you using?




was (Author: talli...@mitre.org):
{quote}
field: (SEARCH TOOL'S PROVIDER'S AND CONSULTING COMPANY) Gets transformed to 
following:
 +spanNear([field:search, spanNear([field:s, field:provider], 0, true), 
field:s, field:and, field:consulting, field:company], 0, true)
{quote}
Unfortunately, I can't think of a way around this.  In the SpanQueryParser, 
single quotes should be used to mark a token that should not be further parsed, 
i.e. '/files/a/b/c/path.html' should be treated as a string not a regex.  I 
toyed with requiring a space before the start ' and space after the ', but that 
seemed hacky.

If you escape your apostrophes, you should get the results you expect (this is 
with a whitespace analyzer, you may get different results with 
StandardAnalyzer):
{noformat} SEARCH TOOL\\'S SOLUTION PROVIDER\\'S TECHNOLOGY CO., LTD{noformat}
yields:f1:search f1:tool's f1:solution f1:provider's f1:technology f1:co., 
f1:ltd
{noformat}

{quote}q=field: (SEARCH TOOLS PROVIDER  CONSULTING COMPANY) Gets transformed 
to following:
 +spanNear([field:search, field:tools, field:provider, field:, 
field:consulting, field:company], 0, true)
{quote}
I think this is fixed on github.  What Analyzer chain are you using?



 [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to 
 classic QueryParser
 ---

 Key: LUCENE-5205
 URL: https://issues.apache.org/jira/browse/LUCENE-5205
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser
Reporter: Tim Allison
  Labels: patch
 Fix For: 4.9

 Attachments: LUCENE-5205-cleanup-tests.patch, 
 LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, 
 LUCENE-5205_dateTestReInitPkgPrvt.patch, 
 LUCENE-5205_improve_stop_word_handling.patch, 
 LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, 
 SpanQueryParser_v1.patch.gz, patch.txt


 This parser extends QueryParserBase and includes functionality from:
 * Classic QueryParser: most of its syntax
 * SurroundQueryParser: recursive parsing for near and not clauses.
 * ComplexPhraseQueryParser: can handle near queries that include multiterms 
 (wildcard, fuzzy, regex, prefix),
 * AnalyzingQueryParser: has an option to analyze multiterms.
 At a high level, there's a first pass BooleanQuery/field parser and then a 
 span query parser handles all terminal nodes and phrases.
 Same as classic syntax:
 * term: test 
 * fuzzy: roam~0.8, roam~2
 * wildcard: te?t, test*, t*st
 * regex: /\[mb\]oat/
 * phrase: jakarta apache
 * phrase with slop: jakarta apache~3
 * default or clause: jakarta apache
 * grouping or clause: (jakarta apache)
 * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
 * multiple fields: title:lucene author:hatcher
  
 Main additions in SpanQueryParser syntax vs. classic syntax:
 * Can require in order for phrases with slop with the \~ operator: 
 jakarta apache\~3
 * Can specify not near: fever bieber!\~3,10 ::
 find fever but not if bieber appears within 3 words before or 10 
 words after it.
 * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta 
 apache\]~3 lucene\]\~4 :: 
 find jakarta within 3 words of apache, and that hit has to be within 
 four words before lucene
 * Can also use \[\] for single level phrasal queries instead of  as in: 
 

[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser

2014-07-23 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070529#comment-14070529
 ] 

Tim Allison edited comment on LUCENE-5205 at 7/23/14 11:23 AM:
---

Unrelated to work on LUCENE-5758, I added a standalone package including a jar 
to track with current latest stable distro of Lucene here: 
https://github.com/tballison/lucene-addons/tree/master/lucene-5205

For trunk integration, see lucene-5205 branch of my fork on github.


was (Author: talli...@mitre.org):
Unrelated to work on LUCENE-5758, I added a standalone package including a jar 
to track with current latest stable distro of Lucene here: 
https://github.com/tballison/tallison-lucene-addons/tree/master/lucene-5205

For trunk integration, see lucene-5205 branch of my fork on github.

 [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to 
 classic QueryParser
 ---

 Key: LUCENE-5205
 URL: https://issues.apache.org/jira/browse/LUCENE-5205
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser
Reporter: Tim Allison
  Labels: patch
 Fix For: 4.9

 Attachments: LUCENE-5205-cleanup-tests.patch, 
 LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, 
 LUCENE-5205_dateTestReInitPkgPrvt.patch, 
 LUCENE-5205_improve_stop_word_handling.patch, 
 LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, 
 SpanQueryParser_v1.patch.gz, patch.txt


 This parser extends QueryParserBase and includes functionality from:
 * Classic QueryParser: most of its syntax
 * SurroundQueryParser: recursive parsing for near and not clauses.
 * ComplexPhraseQueryParser: can handle near queries that include multiterms 
 (wildcard, fuzzy, regex, prefix),
 * AnalyzingQueryParser: has an option to analyze multiterms.
 At a high level, there's a first pass BooleanQuery/field parser and then a 
 span query parser handles all terminal nodes and phrases.
 Same as classic syntax:
 * term: test 
 * fuzzy: roam~0.8, roam~2
 * wildcard: te?t, test*, t*st
 * regex: /\[mb\]oat/
 * phrase: jakarta apache
 * phrase with slop: jakarta apache~3
 * default or clause: jakarta apache
 * grouping or clause: (jakarta apache)
 * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
 * multiple fields: title:lucene author:hatcher
  
 Main additions in SpanQueryParser syntax vs. classic syntax:
 * Can require in order for phrases with slop with the \~ operator: 
 jakarta apache\~3
 * Can specify not near: fever bieber!\~3,10 ::
 find fever but not if bieber appears within 3 words before or 10 
 words after it.
 * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta 
 apache\]~3 lucene\]\~4 :: 
 find jakarta within 3 words of apache, and that hit has to be within 
 four words before lucene
 * Can also use \[\] for single level phrasal queries instead of  as in: 
 \[jakarta apache\]
 * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 
 :: find apache and then either lucene or solr within three words.
 * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2
 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ 
 /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within two 
 words of ap*che and that hit has to be within ten words of something like 
 solr or that lucene regex.
 * Can require at least x number of hits at boolean level: apache AND (lucene 
 solr tika)~2
 * Can use negative only query: -jakarta :: Find all docs that don't contain 
 jakarta
 * Can use an edit distance  2 for fuzzy query via SlowFuzzyQuery (beware of 
 potential performance issues!).
 Trivial additions:
 * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, 
 prefix =2)
 * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance 
 =2: (jakarta~1 (OSA) vs jakarta~1(Levenshtein)
 This parser can be very useful for concordance tasks (see also LUCENE-5317 
 and LUCENE-5318) and for analytical search.  
 Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery.
 Most of the documentation is in the javadoc for SpanQueryParser.
 Any and all feedback is welcome.  Thank you.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser

2014-06-12 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14029566#comment-14029566
 ] 

Paul Elschot edited comment on LUCENE-5205 at 6/12/14 6:35 PM:
---

Created LUCENE-5758 to extend SpanQueryParser with positional joins.


was (Author: paul.elsc...@xs4all.nl):
Created LUCENE-5728 to extend SpanQueryParser with positional joins.

 [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to 
 classic QueryParser
 ---

 Key: LUCENE-5205
 URL: https://issues.apache.org/jira/browse/LUCENE-5205
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser
Reporter: Tim Allison
  Labels: patch
 Fix For: 4.9

 Attachments: LUCENE-5205-cleanup-tests.patch, 
 LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, 
 LUCENE-5205_dateTestReInitPkgPrvt.patch, 
 LUCENE-5205_improve_stop_word_handling.patch, 
 LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, 
 SpanQueryParser_v1.patch.gz, patch.txt


 This parser extends QueryParserBase and includes functionality from:
 * Classic QueryParser: most of its syntax
 * SurroundQueryParser: recursive parsing for near and not clauses.
 * ComplexPhraseQueryParser: can handle near queries that include multiterms 
 (wildcard, fuzzy, regex, prefix),
 * AnalyzingQueryParser: has an option to analyze multiterms.
 At a high level, there's a first pass BooleanQuery/field parser and then a 
 span query parser handles all terminal nodes and phrases.
 Same as classic syntax:
 * term: test 
 * fuzzy: roam~0.8, roam~2
 * wildcard: te?t, test*, t*st
 * regex: /\[mb\]oat/
 * phrase: jakarta apache
 * phrase with slop: jakarta apache~3
 * default or clause: jakarta apache
 * grouping or clause: (jakarta apache)
 * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
 * multiple fields: title:lucene author:hatcher
  
 Main additions in SpanQueryParser syntax vs. classic syntax:
 * Can require in order for phrases with slop with the \~ operator: 
 jakarta apache\~3
 * Can specify not near: fever bieber!\~3,10 ::
 find fever but not if bieber appears within 3 words before or 10 
 words after it.
 * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta 
 apache\]~3 lucene\]\~4 :: 
 find jakarta within 3 words of apache, and that hit has to be within 
 four words before lucene
 * Can also use \[\] for single level phrasal queries instead of  as in: 
 \[jakarta apache\]
 * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 
 :: find apache and then either lucene or solr within three words.
 * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2
 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ 
 /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within two 
 words of ap*che and that hit has to be within ten words of something like 
 solr or that lucene regex.
 * Can require at least x number of hits at boolean level: apache AND (lucene 
 solr tika)~2
 * Can use negative only query: -jakarta :: Find all docs that don't contain 
 jakarta
 * Can use an edit distance  2 for fuzzy query via SlowFuzzyQuery (beware of 
 potential performance issues!).
 Trivial additions:
 * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, 
 prefix =2)
 * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance 
 =2: (jakarta~1 (OSA) vs jakarta~1(Levenshtein)
 This parser can be very useful for concordance tasks (see also LUCENE-5317 
 and LUCENE-5318) and for analytical search.  
 Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery.
 Most of the documentation is in the javadoc for SpanQueryParser.
 Any and all feedback is welcome.  Thank you.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser

2014-03-14 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935395#comment-13935395
 ] 

Tim Allison edited comment on LUCENE-5205 at 3/14/14 6:48 PM:
--

[~otis], thank you for raising this point for discussion.  Yes, I acknowledged 
LUCENE-2878 in the original description of this issue, and that process has 
been ongoing since 2011.  My hope was that key SpanQuery functionality was 
going to be moved over to regular queries.  If this happens, I can modify this 
parser to handle those mods. 

Whatever functionality of Spans that is not moved over will have to disappear 
from this parser, and that could be painful depending on what functionality is 
not transitioned.

When SpanQueries get nuked, what will happen to:
1) in order and not in order phrases
2) functionality of SpanNot
3) searching for a phrase within a proximity of something 

I think those are the three main things that can't be handled by regular 
queries at this point.



was (Author: talli...@mitre.org):
[~otis], thank you for raising this point for discussion.  Yes, I acknowledged 
LUCENE-2878 in the original description of this issue, and that process has 
been ongoing since 2011.  My hope was that key SpanQuery functionality was 
going to be moved over to regular queries.  If this happens, I can modify this 
parser to handle those mods. 

Whatever functionality of Spans that is not moved over will have to disappear 
from this parser, and that could be painful depending on what functionality is 
not transitioned.

So, when SpanQueries get nuked, will it be possible to create a query to:
1) search for a set of words in order and not in order (SpanNear)
2) search for a phrase or an Or clause near another phrase or an Or clause 
(recursion capability of SpanQuery)?

I think those are the two main things that can't be handled by regular queries 
at this point.


 [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to 
 classic QueryParser
 ---

 Key: LUCENE-5205
 URL: https://issues.apache.org/jira/browse/LUCENE-5205
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser
Reporter: Tim Allison
  Labels: patch
 Fix For: 4.8

 Attachments: LUCENE-5205-cleanup-tests.patch, 
 LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, 
 LUCENE-5205_dateTestReInitPkgPrvt.patch, 
 LUCENE-5205_improve_stop_word_handling.patch, 
 LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, 
 SpanQueryParser_v1.patch.gz, patch.txt


 This parser extends QueryParserBase and includes functionality from:
 * Classic QueryParser: most of its syntax
 * SurroundQueryParser: recursive parsing for near and not clauses.
 * ComplexPhraseQueryParser: can handle near queries that include multiterms 
 (wildcard, fuzzy, regex, prefix),
 * AnalyzingQueryParser: has an option to analyze multiterms.
 At a high level, there's a first pass BooleanQuery/field parser and then a 
 span query parser handles all terminal nodes and phrases.
 Same as classic syntax:
 * term: test 
 * fuzzy: roam~0.8, roam~2
 * wildcard: te?t, test*, t*st
 * regex: /\[mb\]oat/
 * phrase: jakarta apache
 * phrase with slop: jakarta apache~3
 * default or clause: jakarta apache
 * grouping or clause: (jakarta apache)
 * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
 * multiple fields: title:lucene author:hatcher
  
 Main additions in SpanQueryParser syntax vs. classic syntax:
 * Can require in order for phrases with slop with the \~ operator: 
 jakarta apache\~3
 * Can specify not near: fever bieber!\~3,10 ::
 find fever but not if bieber appears within 3 words before or 10 
 words after it.
 * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta 
 apache\]~3 lucene\]\~4 :: 
 find jakarta within 3 words of apache, and that hit has to be within 
 four words before lucene
 * Can also use \[\] for single level phrasal queries instead of  as in: 
 \[jakarta apache\]
 * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 
 :: find apache and then either lucene or solr within three words.
 * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2
 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ 
 /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within two 
 words of ap*che and that hit has to be within ten words of something like 
 solr or that lucene regex.
 * Can require at least x number of hits at boolean level: apache AND (lucene 
 solr tika)~2
 * Can use negative only query: -jakarta :: Find all docs that don't contain 
 jakarta
 * Can use an edit distance  2 for fuzzy query via SlowFuzzyQuery (beware of 

[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser

2014-03-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923915#comment-13923915
 ] 

Tim Allison edited comment on LUCENE-5205 at 3/7/14 2:36 PM:
-

The root of this problem is that SpanNearQuery has no good way to handle 
stopwords in a way analagous to PhraseQuery.

In SpanQueryParser, this limitation should be well described in the javadocs to 
SpanQueryParser and in the test cases.  Let me know if it isn't.  You have the 
option of throwing an exception when a stopword is found to notify the user 
about stopwords, but that's exceedingly unsatisfactory.

Without digging into the internals of SpanNearQuery, we can still do better on 
this.  One proposal is to do what the basic highlighter does and risk false 
positives...behind the scenes modify calculator for evaluating to calculator 
evaluating~1.  This would then falsely match calculator zebra evaluating.  
PhraseQuery can have false positives, too, but it guarantees that the false hit 
has to be a stop word.  This solution would not do that.  So, is this better 
than no matches at all?



was (Author: talli...@mitre.org):
The root of this problem is that SpanNearIQuery has no good way to handle 
stopwords in a way analagous to PhraseQuery.

In SpanQueryParser, this limitation should be well described in the javadocs to 
SpanQueryParser and in the test cases.  Let me know if it isn't.  You have the 
option of throwing an exception when a stopword is found to notify the user 
about stopwords, but that's exceedingly unsatisfactory.

Without digging into the internals of SpanNearQuery, we can still do better on 
this.  One proposal is to do what the basic highlighter does and risk false 
positives...behind the scenes modify calculator for evaluating to calculator 
evaluating~1.  This would then falsely match calculator zebra evaluating.  
PhraseQuery can have false positives, too, but it guarantees that the false hit 
has to be a stop word.  This solution would not do that.  So, is this better 
than no matches at all?


 [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to 
 classic QueryParser
 ---

 Key: LUCENE-5205
 URL: https://issues.apache.org/jira/browse/LUCENE-5205
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser
Reporter: Tim Allison
  Labels: patch
 Fix For: 4.7

 Attachments: LUCENE-5205-cleanup-tests.patch, 
 LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, 
 LUCENE-5205_dateTestReInitPkgPrvt.patch, LUCENE-5205_smallTestMods.patch, 
 LUCENE_5205.patch, SpanQueryParser_v1.patch.gz, patch.txt


 This parser extends QueryParserBase and includes functionality from:
 * Classic QueryParser: most of its syntax
 * SurroundQueryParser: recursive parsing for near and not clauses.
 * ComplexPhraseQueryParser: can handle near queries that include multiterms 
 (wildcard, fuzzy, regex, prefix),
 * AnalyzingQueryParser: has an option to analyze multiterms.
 At a high level, there's a first pass BooleanQuery/field parser and then a 
 span query parser handles all terminal nodes and phrases.
 Same as classic syntax:
 * term: test 
 * fuzzy: roam~0.8, roam~2
 * wildcard: te?t, test*, t*st
 * regex: /\[mb\]oat/
 * phrase: jakarta apache
 * phrase with slop: jakarta apache~3
 * default or clause: jakarta apache
 * grouping or clause: (jakarta apache)
 * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
 * multiple fields: title:lucene author:hatcher
  
 Main additions in SpanQueryParser syntax vs. classic syntax:
 * Can require in order for phrases with slop with the \~ operator: 
 jakarta apache\~3
 * Can specify not near: fever bieber!\~3,10 ::
 find fever but not if bieber appears within 3 words before or 10 
 words after it.
 * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta 
 apache\]~3 lucene\]\~4 :: 
 find jakarta within 3 words of apache, and that hit has to be within 
 four words before lucene
 * Can also use \[\] for single level phrasal queries instead of  as in: 
 \[jakarta apache\]
 * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 
 :: find apache and then either lucene or solr within three words.
 * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2
 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ 
 /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within two 
 words of ap*che and that hit has to be within ten words of something like 
 solr or that lucene regex.
 * Can require at least x number of hits at boolean level: apache AND (lucene 
 solr tika)~2
 * Can use negative only query: -jakarta :: Find all docs 

[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser

2014-02-20 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907304#comment-13907304
 ] 

Tim Allison edited comment on LUCENE-5205 at 2/20/14 7:03 PM:
--

Code duplication.  The biggest offenders are in test (I think, let me know if 
you disagree):
 1) TestSpanQPBasedonQPTestBase...I can try to refactor this to extend 
QPTestBase, but that will require some reworking of QPTestBase as, and I didn't 
want to touch that (hence the duplication).  It would also help to add a 
getQuery() to SpanMultitermQueryWrapper to test for equality...again, I didn't 
want to touch anything outside of the parser at the cost of duplication. 
 2) TestMultiAnalyzer.  This relies on testing equality of string 
representations of queries.  Will have to modify TestMultiAnalyzer in way 
similar to QPTestBase.
 3) TestComplexPhraseQuery.  Should be straightforward to extend the 
original, but will need to make checkMatches public so that I can override it.  
I'll also have to move the tests with slightly different syntax into a 
different test, but that's easy and would help declutter.

There's other code duplication with AnalyzingQueryParser...should we break that 
functionality out into a helper class?

Any other major duplication areas?

Y, I don't like the reinit at all.  The reason that's there was so that I could 
extend QueryParserBase, but I'm not sure that that decision buys much anymore.  
As I remember, it buys date parsing in range queries (which I'm now not sure I 
actually want) and addBoolean; there may be more, but I'm not sure there is.

It would clean up a fair bit of code if I implement 
CommonQueryParserConfiguration instead of extending QueryParserBase.  I'd still 
have to leave in some things that don't make sense for the SpanQueryParser, 
though: lowerCaseExpandedTerms, enablePositionIncrements.  Another option would 
be to abandon CQPC, but I wanted this parser to at least implement that 
interface.  Let me know what makes sense.  

As for the public base classes, y, those can go private for now.  I made them 
public in case anyone wanted to extend them, but, as you point out, then I 
really ought to add javadocs and treat them as if they were public (which they 
are!).

As for date/locale issues, I'll take a look.


was (Author: talli...@mitre.org):
Code duplication.  The biggest offenders are in test (I think, let me know if 
you disagree):
 1) TestSpanQPBasedonQPTestBase...I can try to refactor this to extend 
QPTestBase, but that will require some reworking of QPTestBase as, and I didn't 
want to touch that (hence the duplication).  It would also help to add a 
getQuery() to SpanMultitermQueryWrapper to test for equality...again, I didn't 
want to touch anything outside of the parser at the cost of duplication. 
 2) TestMultiAnalyzer...not sure how not to duplicate.  This relies on 
testing equality of string representations of queries.
 3) TestComplexPhraseQuery.  Should be straightforward to extend the 
original, but will need to make checkMatches public so that I can override it.  
I'll also have to move the tests with slightly different syntax into a 
different test, but that's easy and would help declutter.

There's other code duplication with AnalyzingQueryParser...should we break that 
functionality out into a helper class?

Any other major duplication areas?

Y, I don't like the reinit at all.  The reason that's there was so that I could 
extend QueryParserBase, but I'm not sure that that decision buys much anymore.  
As I remember, it buys date parsing in range queries (which I'm now not sure I 
actually want) and addBoolean; there may be more, but I'm not sure there is.

It would clean up a fair bit of code if I implement 
CommonQueryParserConfiguration instead of extending QueryParserBase.  I'd still 
have to leave in some things that don't make sense for the SpanQueryParser, 
though: lowerCaseExpandedTerms, enablePositionIncrements.  Another option would 
be to abandon CQPC, but I wanted this parser to at least implement that 
interface.  Let me know what makes sense.  

As for the public base classes, y, those can go private for now.  I made them 
public in case anyone wanted to extend them, but, as you point out, then I 
really ought to add javadocs and treat them as if they were public (which they 
are!).

As for date/locale issues, I'll take a look.

 [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to 
 classic QueryParser
 ---

 Key: LUCENE-5205
 URL: https://issues.apache.org/jira/browse/LUCENE-5205
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser
Reporter: Tim Allison
  Labels: 

[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser

2014-02-20 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907354#comment-13907354
 ] 

Tim Allison edited comment on LUCENE-5205 at 2/20/14 7:20 PM:
--

{quote}
I know this probably exists elsewhere, i swear there is something in QPBase 
doing this for range queries. 
{quote}
Busted...yeah...I meant to include that in my list of duplication.  Sorry.  

As for Reinit, you're absolutely right...no one is calling that...let's go with 
documentation and assert false.

Will start some patches on the smaller stuff.  Thank you, again!



was (Author: talli...@mitre.org):
{quote}
I know this probably exists elsewhere, i swear there is something in QPBase 
doing this for range queries. 
{quote}
Busted...yeah...I meant to include that in my list of duplication.  Sorry.  

As for Reinit, you're absolutely right...no one is calling that...let's go with 
documentation and assert false.

Will start some patches on the smaller stuff.  Thank you, again!

As 

 [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to 
 classic QueryParser
 ---

 Key: LUCENE-5205
 URL: https://issues.apache.org/jira/browse/LUCENE-5205
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser
Reporter: Tim Allison
  Labels: patch
 Fix For: 4.7

 Attachments: LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, 
 LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, 
 SpanQueryParser_v1.patch.gz, patch.txt


 This parser extends QueryParserBase and includes functionality from:
 * Classic QueryParser: most of its syntax
 * SurroundQueryParser: recursive parsing for near and not clauses.
 * ComplexPhraseQueryParser: can handle near queries that include multiterms 
 (wildcard, fuzzy, regex, prefix),
 * AnalyzingQueryParser: has an option to analyze multiterms.
 At a high level, there's a first pass BooleanQuery/field parser and then a 
 span query parser handles all terminal nodes and phrases.
 Same as classic syntax:
 * term: test 
 * fuzzy: roam~0.8, roam~2
 * wildcard: te?t, test*, t*st
 * regex: /\[mb\]oat/
 * phrase: jakarta apache
 * phrase with slop: jakarta apache~3
 * default or clause: jakarta apache
 * grouping or clause: (jakarta apache)
 * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
 * multiple fields: title:lucene author:hatcher
  
 Main additions in SpanQueryParser syntax vs. classic syntax:
 * Can require in order for phrases with slop with the \~ operator: 
 jakarta apache\~3
 * Can specify not near: fever bieber!\~3,10 ::
 find fever but not if bieber appears within 3 words before or 10 
 words after it.
 * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta 
 apache\]~3 lucene\]\~4 :: 
 find jakarta within 3 words of apache, and that hit has to be within 
 four words before lucene
 * Can also use \[\] for single level phrasal queries instead of  as in: 
 \[jakarta apache\]
 * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 
 :: find apache and then either lucene or solr within three words.
 * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2
 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ 
 /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within two 
 words of ap*che and that hit has to be within ten words of something like 
 solr or that lucene regex.
 * Can require at least x number of hits at boolean level: apache AND (lucene 
 solr tika)~2
 * Can use negative only query: -jakarta :: Find all docs that don't contain 
 jakarta
 * Can use an edit distance  2 for fuzzy query via SlowFuzzyQuery (beware of 
 potential performance issues!).
 Trivial additions:
 * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, 
 prefix =2)
 * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance 
 =2: (jakarta~1 (OSA) vs jakarta~1(Levenshtein)
 This parser can be very useful for concordance tasks (see also LUCENE-5317 
 and LUCENE-5318) and for analytical search.  
 Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery.
 Most of the documentation is in the javadoc for SpanQueryParser.
 Any and all feedback is welcome.  Thank you.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser

2014-02-20 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907588#comment-13907588
 ] 

Tim Allison edited comment on LUCENE-5205 at 2/20/14 10:01 PM:
---

Back on the issue of reinventing analyzeMultiterm...part of that reinvention 
was  because I was getting a setReader() in wrong state exception in one of my 
tests.  With analyzeMultiterm in QueryParserBase as it stands, the token stream 
is not consumed if an exception is thrown.  Therefore, next time you run the 
parser (with the same analyzer) you can get:
{noformat}
   [junit4] Throwable #1: java.lang.AssertionError: setReader() called in 
wrong state: INCREMENT_FALSE
   [junit4]at 
__randomizedtesting.SeedInfo.seed([6E1DC3D6C716BC75:EDEE4C0E5E329586]:0)
   [junit4]at 
org.apache.lucene.analysis.MockTokenizer.setReaderTestPoint(MockTokenizer.java:266)
   [junit4]at 
org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:92)
   [junit4]at 
org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:304)
   [junit4]at 
org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:181)
{noformat}

Should I fix this in QueryParserBase or did I not build the test analyzer 
correctly? 


was (Author: talli...@mitre.org):
Back on the issue of reinventing analyzeMultiterm...part of that reinvention 
was  because I was getting a Analyzer.setReader() in wrong state exception in 
one of my tests.  With analyzeMultiterm in QueryParserBase as it stands, the 
analyzer is not consumed if an exception is thrown.  Therefore, next time you 
run the parser (with the same analyzer) you can get:
{noformat}
   [junit4] Throwable #1: java.lang.AssertionError: setReader() called in 
wrong state: INCREMENT_FALSE
   [junit4]at 
__randomizedtesting.SeedInfo.seed([6E1DC3D6C716BC75:EDEE4C0E5E329586]:0)
   [junit4]at 
org.apache.lucene.analysis.MockTokenizer.setReaderTestPoint(MockTokenizer.java:266)
   [junit4]at 
org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:92)
   [junit4]at 
org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:304)
   [junit4]at 
org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:181)
{noformat}

Should I fix this in QueryParserBase or did I not build the test analyzer 
correctly? 

 [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to 
 classic QueryParser
 ---

 Key: LUCENE-5205
 URL: https://issues.apache.org/jira/browse/LUCENE-5205
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser
Reporter: Tim Allison
  Labels: patch
 Fix For: 4.7

 Attachments: LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, 
 LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, 
 SpanQueryParser_v1.patch.gz, patch.txt


 This parser extends QueryParserBase and includes functionality from:
 * Classic QueryParser: most of its syntax
 * SurroundQueryParser: recursive parsing for near and not clauses.
 * ComplexPhraseQueryParser: can handle near queries that include multiterms 
 (wildcard, fuzzy, regex, prefix),
 * AnalyzingQueryParser: has an option to analyze multiterms.
 At a high level, there's a first pass BooleanQuery/field parser and then a 
 span query parser handles all terminal nodes and phrases.
 Same as classic syntax:
 * term: test 
 * fuzzy: roam~0.8, roam~2
 * wildcard: te?t, test*, t*st
 * regex: /\[mb\]oat/
 * phrase: jakarta apache
 * phrase with slop: jakarta apache~3
 * default or clause: jakarta apache
 * grouping or clause: (jakarta apache)
 * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
 * multiple fields: title:lucene author:hatcher
  
 Main additions in SpanQueryParser syntax vs. classic syntax:
 * Can require in order for phrases with slop with the \~ operator: 
 jakarta apache\~3
 * Can specify not near: fever bieber!\~3,10 ::
 find fever but not if bieber appears within 3 words before or 10 
 words after it.
 * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta 
 apache\]~3 lucene\]\~4 :: 
 find jakarta within 3 words of apache, and that hit has to be within 
 four words before lucene
 * Can also use \[\] for single level phrasal queries instead of  as in: 
 \[jakarta apache\]
 * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 
 :: find apache and then either lucene or solr within three words.
 * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2
 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ 
 /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within