[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217708#comment-14217708 ] Modassar Ather edited comment on LUCENE-5205 at 11/19/14 10:45 AM: --- Thanks [~talli...@apache.org] for your response. I am using the SpanQuryParser and fix for query hanging issue from your github site as provided in your comment. I am using WhiteSpaceTokenizer. With WhiteSpaceTokenizer: q=field: (SEARCH TOOLS PROVIDER CONSULTING COMPANY) still gets transformed to following: +spanNear([field:search, field:tools, field:provider, field:, field:consulting, field:company], 0, true) I am trying to find the possible cause of the removal of '' in my config. was (Author: modassar): Thanks [~talli...@apache.org] for your response. I am using the SpanQuryParser and fix for query hanging issue from your github site as provided in your comment. I tried using WhiteSpaceTokenizer. Will check with StandardAnalyzer too. With WhiteSpaceTokenizer: q=field: (SEARCH TOOLS PROVIDER CONSULTING COMPANY) still gets transformed to following: +spanNear([field:search, field:tools, field:provider, field:, field:consulting, field:company], 0, true) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser --- Key: LUCENE-5205 URL: https://issues.apache.org/jira/browse/LUCENE-5205 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Tim Allison Labels: patch Fix For: 4.9 Attachments: LUCENE-5205-cleanup-tests.patch, LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, LUCENE-5205_dateTestReInitPkgPrvt.patch, LUCENE-5205_improve_stop_word_handling.patch, LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, SpanQueryParser_v1.patch.gz, patch.txt This parser extends QueryParserBase and includes functionality from: * Classic QueryParser: most of its syntax * SurroundQueryParser: recursive parsing for near and not clauses. * ComplexPhraseQueryParser: can handle near queries that include multiterms (wildcard, fuzzy, regex, prefix), * AnalyzingQueryParser: has an option to analyze multiterms. At a high level, there's a first pass BooleanQuery/field parser and then a span query parser handles all terminal nodes and phrases. Same as classic syntax: * term: test * fuzzy: roam~0.8, roam~2 * wildcard: te?t, test*, t*st * regex: /\[mb\]oat/ * phrase: jakarta apache * phrase with slop: jakarta apache~3 * default or clause: jakarta apache * grouping or clause: (jakarta apache) * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta * multiple fields: title:lucene author:hatcher Main additions in SpanQueryParser syntax vs. classic syntax: * Can require in order for phrases with slop with the \~ operator: jakarta apache\~3 * Can specify not near: fever bieber!\~3,10 :: find fever but not if bieber appears within 3 words before or 10 words after it. * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta apache\]~3 lucene\]\~4 :: find jakarta within 3 words of apache, and that hit has to be within four words before lucene * Can also use \[\] for single level phrasal queries instead of as in: \[jakarta apache\] * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 :: find apache and then either lucene or solr within three words. * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within two words of ap*che and that hit has to be within ten words of something like solr or that lucene regex. * Can require at least x number of hits at boolean level: apache AND (lucene solr tika)~2 * Can use negative only query: -jakarta :: Find all docs that don't contain jakarta * Can use an edit distance 2 for fuzzy query via SlowFuzzyQuery (beware of potential performance issues!). Trivial additions: * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, prefix =2) * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance =2: (jakarta~1 (OSA) vs jakarta~1(Levenshtein) This parser can be very useful for concordance tasks (see also LUCENE-5317 and LUCENE-5318) and for analytical search. Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery. Most of the documentation is in the javadoc for SpanQueryParser. Any and all feedback is welcome. Thank you. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217793#comment-14217793 ] Tim Allison edited comment on LUCENE-5205 at 11/19/14 11:58 AM: Good to hear the github workaround works. If a committer has any interest in taking this on, it would be great to merge this into trunk...and then we could deprecate AnalyzingQueryParser, SurroundQueryParser and ComplexPhraseQueryParser just in time for 5.0. :) In pure Lucene, with a WhitespaceAnalyzer, the '' is still making it through the parsing process. {noformat} spanNear([field:SEARCH, field:TOOLS, field:PROVIDER, field:, field:CONSULTING, field:COMPANY], 0, true) {noformat} When I use a StandardAnalyzer, the '' is correctly dropped: {noformat} spanNear([field:search, field:tools, field:provider, field:consulting, field:company], 1, true) {noformat} What filters are you applying? From your output, at least the LowerCaseFilterFactory, but anything else? was (Author: talli...@mitre.org): Good to hear the github workaround works. If a committer has any interest in taking this on, it would be great to merge this into trunk...and then we could deprecate AnalyzingQueryParser, SurroundQueryParser and ComplexPhraseQueryParser just in time for 5.0. :) In pure Lucene, with a WhitespaceAnalyzer, the '' is still making it through the parsing process. {noformat} spanNear([field:SEARCH, field:TOOLS, field:PROVIDER, field:, field:CONSULTING, field:COMPANY], 0, true) {noformat} What filters are you applying? From your output, at least the LowerCaseFilterFactory, but anything else? [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser --- Key: LUCENE-5205 URL: https://issues.apache.org/jira/browse/LUCENE-5205 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Tim Allison Labels: patch Fix For: 4.9 Attachments: LUCENE-5205-cleanup-tests.patch, LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, LUCENE-5205_dateTestReInitPkgPrvt.patch, LUCENE-5205_improve_stop_word_handling.patch, LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, SpanQueryParser_v1.patch.gz, patch.txt This parser extends QueryParserBase and includes functionality from: * Classic QueryParser: most of its syntax * SurroundQueryParser: recursive parsing for near and not clauses. * ComplexPhraseQueryParser: can handle near queries that include multiterms (wildcard, fuzzy, regex, prefix), * AnalyzingQueryParser: has an option to analyze multiterms. At a high level, there's a first pass BooleanQuery/field parser and then a span query parser handles all terminal nodes and phrases. Same as classic syntax: * term: test * fuzzy: roam~0.8, roam~2 * wildcard: te?t, test*, t*st * regex: /\[mb\]oat/ * phrase: jakarta apache * phrase with slop: jakarta apache~3 * default or clause: jakarta apache * grouping or clause: (jakarta apache) * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta * multiple fields: title:lucene author:hatcher Main additions in SpanQueryParser syntax vs. classic syntax: * Can require in order for phrases with slop with the \~ operator: jakarta apache\~3 * Can specify not near: fever bieber!\~3,10 :: find fever but not if bieber appears within 3 words before or 10 words after it. * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta apache\]~3 lucene\]\~4 :: find jakarta within 3 words of apache, and that hit has to be within four words before lucene * Can also use \[\] for single level phrasal queries instead of as in: \[jakarta apache\] * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 :: find apache and then either lucene or solr within three words. * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within two words of ap*che and that hit has to be within ten words of something like solr or that lucene regex. * Can require at least x number of hits at boolean level: apache AND (lucene solr tika)~2 * Can use negative only query: -jakarta :: Find all docs that don't contain jakarta * Can use an edit distance 2 for fuzzy query via SlowFuzzyQuery (beware of potential performance issues!). Trivial additions: * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, prefix =2) * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance =2: (jakarta~1 (OSA) vs jakarta~1(Levenshtein) This
[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216113#comment-14216113 ] Modassar Ather edited comment on LUCENE-5205 at 11/18/14 11:59 AM: --- I am trying following queries and facing an issue for which need your suggestions. The environment is 4 shard cluster with embedded zookeeper on one of them. q=field: (SEARCH TOOLS PROVIDER CONSULTING COMPANY) Gets transformed to following: +spanNear([field:search, field:tools, field:provider, field:, field:consulting, field:company], 0, true) field: (SEARCH TOOL'S PROVIDER'S AND CONSULTING COMPANY) Gets transformed to following: +spanNear([field:search, spanNear([field:s, field:provider], 0, true), field:s, field:and, field:consulting, field:company], 0, true) field: (SEARCH TOOL'S SOLUTION PROVIDER TECHNOLOGY CO., LTD.) Gets stuck and does not return. We have set query timeAllowed to 5 minutes but it seems that it is not reaching here and continues. During debug I found that it gets stuck at m.find(), Line 154 of SpanQueryLexer after it has created token for double quotes and term SEARCH. Whereas the above query without (') gets transformed to following field: (SEARCH TOOLS SOLUTION PROVIDER TECHNOLOGY CO., LTD.) = +spanNear([field:search, field:tools, field:solution, field:provider, field:technology, field:co, field:ltd], 0, true) Need your help in understanding if I am not using the query properly or it can be an issue. was (Author: modassar): I am trying following queries and facing an issue for which need your suggestions. The environment is 4 shard cluster with embedded zookeeper on one of them. q=field:(SEARCH TOOLS PROVIDER CONSULTING COMPANY) Gets transformed to following: +spanNear([field:search, field:tools, field:provider, field:, field:consulting, field:company], 0, true) field:(SEARCH TOOL'S PROVIDER'S AND CONSULTING COMPANY) Gets transformed to following: +spanNear([field:search, spanNear([field:s, field:provider], 0, true), field:s, field:and, field:consulting, field:company], 0, true) field:(SEARCH TOOL'S SOLUTION PROVIDER TECHNOLOGY CO., LTD.) Gets stuck and does not return. We have set query timeAllowed to 5 minutes but it seems that it is not reaching here and continues. During debug I found that it gets stuck at m.find(), Line 154 of SpanQueryLexer after it has created token for double quotes and term SEARCH. Whereas the above query without (') gets transformed to following field:(SEARCH TOOLS SOLUTION PROVIDER TECHNOLOGY CO., LTD.) = +spanNear([field:search, field:tools, field:solution, field:provider, field:technology, field:co, field:ltd], 0, true) Need your help in understanding if I am not using the query properly or it can be an issue. [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser --- Key: LUCENE-5205 URL: https://issues.apache.org/jira/browse/LUCENE-5205 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Tim Allison Labels: patch Fix For: 4.9 Attachments: LUCENE-5205-cleanup-tests.patch, LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, LUCENE-5205_dateTestReInitPkgPrvt.patch, LUCENE-5205_improve_stop_word_handling.patch, LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, SpanQueryParser_v1.patch.gz, patch.txt This parser extends QueryParserBase and includes functionality from: * Classic QueryParser: most of its syntax * SurroundQueryParser: recursive parsing for near and not clauses. * ComplexPhraseQueryParser: can handle near queries that include multiterms (wildcard, fuzzy, regex, prefix), * AnalyzingQueryParser: has an option to analyze multiterms. At a high level, there's a first pass BooleanQuery/field parser and then a span query parser handles all terminal nodes and phrases. Same as classic syntax: * term: test * fuzzy: roam~0.8, roam~2 * wildcard: te?t, test*, t*st * regex: /\[mb\]oat/ * phrase: jakarta apache * phrase with slop: jakarta apache~3 * default or clause: jakarta apache * grouping or clause: (jakarta apache) * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta * multiple fields: title:lucene author:hatcher Main additions in SpanQueryParser syntax vs. classic syntax: * Can require in order for phrases with slop with the \~ operator: jakarta apache\~3 * Can specify not near: fever bieber!\~3,10 :: find fever but not if bieber appears within 3 words before or 10 words after it. * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta apache\]~3 lucene\]\~4 :: find jakarta within 3 words of apache, and
[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216113#comment-14216113 ] Modassar Ather edited comment on LUCENE-5205 at 11/18/14 12:01 PM: --- I am trying following queries and facing an issue for which need your suggestions. The environment is 4 shard cluster with embedded zookeeper on one of them. q=field: (SEARCH TOOLS PROVIDER CONSULTING COMPANY) Gets transformed to following: +spanNear([field:search, field:tools, field:provider, field:, field:consulting, field:company], 0, true) field: (SEARCH TOOL'S PROVIDER'S AND CONSULTING COMPANY) Gets transformed to following: +spanNear([field:search, spanNear([field:s, field:provider], 0, true), field:s, field:and, field:consulting, field:company], 0, true) field: (SEARCH TOOL'S SOLUTION PROVIDER TECHNOLOGY CO., LTD.) Gets stuck and does not return. We have set query timeAllowed to 5 minutes but it seems that it is not reaching here and continues. During debug I found that it gets stuck at m.find(), Line 154 of SpanQueryLexer after it has created token for double quotes and term SEARCH. Whereas the above query without (') gets transformed to following field: (SEARCH TOOLS SOLUTION PROVIDER TECHNOLOGY CO., LTD.) = +spanNear([field:search, field:tools, field:solution, field:provider, field:technology, field:co, field:ltd], 0, true) Need your help in understanding if I am not using the query properly or it can be an issue. NOTE: A space between the field: and query is added to avoid transformation to smileys. was (Author: modassar): I am trying following queries and facing an issue for which need your suggestions. The environment is 4 shard cluster with embedded zookeeper on one of them. q=field: (SEARCH TOOLS PROVIDER CONSULTING COMPANY) Gets transformed to following: +spanNear([field:search, field:tools, field:provider, field:, field:consulting, field:company], 0, true) field: (SEARCH TOOL'S PROVIDER'S AND CONSULTING COMPANY) Gets transformed to following: +spanNear([field:search, spanNear([field:s, field:provider], 0, true), field:s, field:and, field:consulting, field:company], 0, true) field: (SEARCH TOOL'S SOLUTION PROVIDER TECHNOLOGY CO., LTD.) Gets stuck and does not return. We have set query timeAllowed to 5 minutes but it seems that it is not reaching here and continues. During debug I found that it gets stuck at m.find(), Line 154 of SpanQueryLexer after it has created token for double quotes and term SEARCH. Whereas the above query without (') gets transformed to following field: (SEARCH TOOLS SOLUTION PROVIDER TECHNOLOGY CO., LTD.) = +spanNear([field:search, field:tools, field:solution, field:provider, field:technology, field:co, field:ltd], 0, true) Need your help in understanding if I am not using the query properly or it can be an issue. [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser --- Key: LUCENE-5205 URL: https://issues.apache.org/jira/browse/LUCENE-5205 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Tim Allison Labels: patch Fix For: 4.9 Attachments: LUCENE-5205-cleanup-tests.patch, LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, LUCENE-5205_dateTestReInitPkgPrvt.patch, LUCENE-5205_improve_stop_word_handling.patch, LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, SpanQueryParser_v1.patch.gz, patch.txt This parser extends QueryParserBase and includes functionality from: * Classic QueryParser: most of its syntax * SurroundQueryParser: recursive parsing for near and not clauses. * ComplexPhraseQueryParser: can handle near queries that include multiterms (wildcard, fuzzy, regex, prefix), * AnalyzingQueryParser: has an option to analyze multiterms. At a high level, there's a first pass BooleanQuery/field parser and then a span query parser handles all terminal nodes and phrases. Same as classic syntax: * term: test * fuzzy: roam~0.8, roam~2 * wildcard: te?t, test*, t*st * regex: /\[mb\]oat/ * phrase: jakarta apache * phrase with slop: jakarta apache~3 * default or clause: jakarta apache * grouping or clause: (jakarta apache) * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta * multiple fields: title:lucene author:hatcher Main additions in SpanQueryParser syntax vs. classic syntax: * Can require in order for phrases with slop with the \~ operator: jakarta apache\~3 * Can specify not near: fever bieber!\~3,10 :: find fever but not if bieber appears within 3 words before or 10 words after it. * Fully recursive phrasal queries with \[ and \]; as in:
[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216571#comment-14216571 ] Tim Allison edited comment on LUCENE-5205 at 11/18/14 6:47 PM: --- {quote} field: (SEARCH TOOL'S PROVIDER'S AND CONSULTING COMPANY) Gets transformed to following: +spanNear([field:search, spanNear([field:s, field:provider], 0, true), field:s, field:and, field:consulting, field:company], 0, true) {quote} Unfortunately, I can't think of a way around this. In the SpanQueryParser, single quotes should be used to mark a token that should not be further parsed, i.e. '/files/a/b/c/path.html' should be treated as a string not a regex. I toyed with requiring a space before the start ' and space after the ', but that seemed hacky. If you escape your apostrophes, you should get the results you expect (this is with a whitespace analyzer, you may get different results with StandardAnalyzer): {noformat} SEARCH TOOL\\'S SOLUTION PROVIDER\\'S TECHNOLOGY CO., LTD{noformat} yields: {noformat} f1:search f1:tool's f1:solution f1:provider's f1:technology f1:co., f1:ltd {noformat} {quote}q=field: (SEARCH TOOLS PROVIDER CONSULTING COMPANY) Gets transformed to following: +spanNear([field:search, field:tools, field:provider, field:, field:consulting, field:company], 0, true) {quote} I think this is fixed on github. What Analyzer chain are you using? was (Author: talli...@mitre.org): {quote} field: (SEARCH TOOL'S PROVIDER'S AND CONSULTING COMPANY) Gets transformed to following: +spanNear([field:search, spanNear([field:s, field:provider], 0, true), field:s, field:and, field:consulting, field:company], 0, true) {quote} Unfortunately, I can't think of a way around this. In the SpanQueryParser, single quotes should be used to mark a token that should not be further parsed, i.e. '/files/a/b/c/path.html' should be treated as a string not a regex. I toyed with requiring a space before the start ' and space after the ', but that seemed hacky. If you escape your apostrophes, you should get the results you expect (this is with a whitespace analyzer, you may get different results with StandardAnalyzer): {noformat} SEARCH TOOL\\'S SOLUTION PROVIDER\\'S TECHNOLOGY CO., LTD{noformat} yields:f1:search f1:tool's f1:solution f1:provider's f1:technology f1:co., f1:ltd {noformat} {quote}q=field: (SEARCH TOOLS PROVIDER CONSULTING COMPANY) Gets transformed to following: +spanNear([field:search, field:tools, field:provider, field:, field:consulting, field:company], 0, true) {quote} I think this is fixed on github. What Analyzer chain are you using? [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser --- Key: LUCENE-5205 URL: https://issues.apache.org/jira/browse/LUCENE-5205 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Tim Allison Labels: patch Fix For: 4.9 Attachments: LUCENE-5205-cleanup-tests.patch, LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, LUCENE-5205_dateTestReInitPkgPrvt.patch, LUCENE-5205_improve_stop_word_handling.patch, LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, SpanQueryParser_v1.patch.gz, patch.txt This parser extends QueryParserBase and includes functionality from: * Classic QueryParser: most of its syntax * SurroundQueryParser: recursive parsing for near and not clauses. * ComplexPhraseQueryParser: can handle near queries that include multiterms (wildcard, fuzzy, regex, prefix), * AnalyzingQueryParser: has an option to analyze multiterms. At a high level, there's a first pass BooleanQuery/field parser and then a span query parser handles all terminal nodes and phrases. Same as classic syntax: * term: test * fuzzy: roam~0.8, roam~2 * wildcard: te?t, test*, t*st * regex: /\[mb\]oat/ * phrase: jakarta apache * phrase with slop: jakarta apache~3 * default or clause: jakarta apache * grouping or clause: (jakarta apache) * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta * multiple fields: title:lucene author:hatcher Main additions in SpanQueryParser syntax vs. classic syntax: * Can require in order for phrases with slop with the \~ operator: jakarta apache\~3 * Can specify not near: fever bieber!\~3,10 :: find fever but not if bieber appears within 3 words before or 10 words after it. * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta apache\]~3 lucene\]\~4 :: find jakarta within 3 words of apache, and that hit has to be within four words before lucene * Can also use \[\] for single level phrasal queries instead of as in:
[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070529#comment-14070529 ] Tim Allison edited comment on LUCENE-5205 at 7/23/14 11:23 AM: --- Unrelated to work on LUCENE-5758, I added a standalone package including a jar to track with current latest stable distro of Lucene here: https://github.com/tballison/lucene-addons/tree/master/lucene-5205 For trunk integration, see lucene-5205 branch of my fork on github. was (Author: talli...@mitre.org): Unrelated to work on LUCENE-5758, I added a standalone package including a jar to track with current latest stable distro of Lucene here: https://github.com/tballison/tallison-lucene-addons/tree/master/lucene-5205 For trunk integration, see lucene-5205 branch of my fork on github. [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser --- Key: LUCENE-5205 URL: https://issues.apache.org/jira/browse/LUCENE-5205 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Tim Allison Labels: patch Fix For: 4.9 Attachments: LUCENE-5205-cleanup-tests.patch, LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, LUCENE-5205_dateTestReInitPkgPrvt.patch, LUCENE-5205_improve_stop_word_handling.patch, LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, SpanQueryParser_v1.patch.gz, patch.txt This parser extends QueryParserBase and includes functionality from: * Classic QueryParser: most of its syntax * SurroundQueryParser: recursive parsing for near and not clauses. * ComplexPhraseQueryParser: can handle near queries that include multiterms (wildcard, fuzzy, regex, prefix), * AnalyzingQueryParser: has an option to analyze multiterms. At a high level, there's a first pass BooleanQuery/field parser and then a span query parser handles all terminal nodes and phrases. Same as classic syntax: * term: test * fuzzy: roam~0.8, roam~2 * wildcard: te?t, test*, t*st * regex: /\[mb\]oat/ * phrase: jakarta apache * phrase with slop: jakarta apache~3 * default or clause: jakarta apache * grouping or clause: (jakarta apache) * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta * multiple fields: title:lucene author:hatcher Main additions in SpanQueryParser syntax vs. classic syntax: * Can require in order for phrases with slop with the \~ operator: jakarta apache\~3 * Can specify not near: fever bieber!\~3,10 :: find fever but not if bieber appears within 3 words before or 10 words after it. * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta apache\]~3 lucene\]\~4 :: find jakarta within 3 words of apache, and that hit has to be within four words before lucene * Can also use \[\] for single level phrasal queries instead of as in: \[jakarta apache\] * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 :: find apache and then either lucene or solr within three words. * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within two words of ap*che and that hit has to be within ten words of something like solr or that lucene regex. * Can require at least x number of hits at boolean level: apache AND (lucene solr tika)~2 * Can use negative only query: -jakarta :: Find all docs that don't contain jakarta * Can use an edit distance 2 for fuzzy query via SlowFuzzyQuery (beware of potential performance issues!). Trivial additions: * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, prefix =2) * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance =2: (jakarta~1 (OSA) vs jakarta~1(Levenshtein) This parser can be very useful for concordance tasks (see also LUCENE-5317 and LUCENE-5318) and for analytical search. Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery. Most of the documentation is in the javadoc for SpanQueryParser. Any and all feedback is welcome. Thank you. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14029566#comment-14029566 ] Paul Elschot edited comment on LUCENE-5205 at 6/12/14 6:35 PM: --- Created LUCENE-5758 to extend SpanQueryParser with positional joins. was (Author: paul.elsc...@xs4all.nl): Created LUCENE-5728 to extend SpanQueryParser with positional joins. [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser --- Key: LUCENE-5205 URL: https://issues.apache.org/jira/browse/LUCENE-5205 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Tim Allison Labels: patch Fix For: 4.9 Attachments: LUCENE-5205-cleanup-tests.patch, LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, LUCENE-5205_dateTestReInitPkgPrvt.patch, LUCENE-5205_improve_stop_word_handling.patch, LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, SpanQueryParser_v1.patch.gz, patch.txt This parser extends QueryParserBase and includes functionality from: * Classic QueryParser: most of its syntax * SurroundQueryParser: recursive parsing for near and not clauses. * ComplexPhraseQueryParser: can handle near queries that include multiterms (wildcard, fuzzy, regex, prefix), * AnalyzingQueryParser: has an option to analyze multiterms. At a high level, there's a first pass BooleanQuery/field parser and then a span query parser handles all terminal nodes and phrases. Same as classic syntax: * term: test * fuzzy: roam~0.8, roam~2 * wildcard: te?t, test*, t*st * regex: /\[mb\]oat/ * phrase: jakarta apache * phrase with slop: jakarta apache~3 * default or clause: jakarta apache * grouping or clause: (jakarta apache) * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta * multiple fields: title:lucene author:hatcher Main additions in SpanQueryParser syntax vs. classic syntax: * Can require in order for phrases with slop with the \~ operator: jakarta apache\~3 * Can specify not near: fever bieber!\~3,10 :: find fever but not if bieber appears within 3 words before or 10 words after it. * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta apache\]~3 lucene\]\~4 :: find jakarta within 3 words of apache, and that hit has to be within four words before lucene * Can also use \[\] for single level phrasal queries instead of as in: \[jakarta apache\] * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 :: find apache and then either lucene or solr within three words. * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within two words of ap*che and that hit has to be within ten words of something like solr or that lucene regex. * Can require at least x number of hits at boolean level: apache AND (lucene solr tika)~2 * Can use negative only query: -jakarta :: Find all docs that don't contain jakarta * Can use an edit distance 2 for fuzzy query via SlowFuzzyQuery (beware of potential performance issues!). Trivial additions: * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, prefix =2) * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance =2: (jakarta~1 (OSA) vs jakarta~1(Levenshtein) This parser can be very useful for concordance tasks (see also LUCENE-5317 and LUCENE-5318) and for analytical search. Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery. Most of the documentation is in the javadoc for SpanQueryParser. Any and all feedback is welcome. Thank you. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13935395#comment-13935395 ] Tim Allison edited comment on LUCENE-5205 at 3/14/14 6:48 PM: -- [~otis], thank you for raising this point for discussion. Yes, I acknowledged LUCENE-2878 in the original description of this issue, and that process has been ongoing since 2011. My hope was that key SpanQuery functionality was going to be moved over to regular queries. If this happens, I can modify this parser to handle those mods. Whatever functionality of Spans that is not moved over will have to disappear from this parser, and that could be painful depending on what functionality is not transitioned. When SpanQueries get nuked, what will happen to: 1) in order and not in order phrases 2) functionality of SpanNot 3) searching for a phrase within a proximity of something I think those are the three main things that can't be handled by regular queries at this point. was (Author: talli...@mitre.org): [~otis], thank you for raising this point for discussion. Yes, I acknowledged LUCENE-2878 in the original description of this issue, and that process has been ongoing since 2011. My hope was that key SpanQuery functionality was going to be moved over to regular queries. If this happens, I can modify this parser to handle those mods. Whatever functionality of Spans that is not moved over will have to disappear from this parser, and that could be painful depending on what functionality is not transitioned. So, when SpanQueries get nuked, will it be possible to create a query to: 1) search for a set of words in order and not in order (SpanNear) 2) search for a phrase or an Or clause near another phrase or an Or clause (recursion capability of SpanQuery)? I think those are the two main things that can't be handled by regular queries at this point. [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser --- Key: LUCENE-5205 URL: https://issues.apache.org/jira/browse/LUCENE-5205 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Tim Allison Labels: patch Fix For: 4.8 Attachments: LUCENE-5205-cleanup-tests.patch, LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, LUCENE-5205_dateTestReInitPkgPrvt.patch, LUCENE-5205_improve_stop_word_handling.patch, LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, SpanQueryParser_v1.patch.gz, patch.txt This parser extends QueryParserBase and includes functionality from: * Classic QueryParser: most of its syntax * SurroundQueryParser: recursive parsing for near and not clauses. * ComplexPhraseQueryParser: can handle near queries that include multiterms (wildcard, fuzzy, regex, prefix), * AnalyzingQueryParser: has an option to analyze multiterms. At a high level, there's a first pass BooleanQuery/field parser and then a span query parser handles all terminal nodes and phrases. Same as classic syntax: * term: test * fuzzy: roam~0.8, roam~2 * wildcard: te?t, test*, t*st * regex: /\[mb\]oat/ * phrase: jakarta apache * phrase with slop: jakarta apache~3 * default or clause: jakarta apache * grouping or clause: (jakarta apache) * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta * multiple fields: title:lucene author:hatcher Main additions in SpanQueryParser syntax vs. classic syntax: * Can require in order for phrases with slop with the \~ operator: jakarta apache\~3 * Can specify not near: fever bieber!\~3,10 :: find fever but not if bieber appears within 3 words before or 10 words after it. * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta apache\]~3 lucene\]\~4 :: find jakarta within 3 words of apache, and that hit has to be within four words before lucene * Can also use \[\] for single level phrasal queries instead of as in: \[jakarta apache\] * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 :: find apache and then either lucene or solr within three words. * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within two words of ap*che and that hit has to be within ten words of something like solr or that lucene regex. * Can require at least x number of hits at boolean level: apache AND (lucene solr tika)~2 * Can use negative only query: -jakarta :: Find all docs that don't contain jakarta * Can use an edit distance 2 for fuzzy query via SlowFuzzyQuery (beware of
[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923915#comment-13923915 ] Tim Allison edited comment on LUCENE-5205 at 3/7/14 2:36 PM: - The root of this problem is that SpanNearQuery has no good way to handle stopwords in a way analagous to PhraseQuery. In SpanQueryParser, this limitation should be well described in the javadocs to SpanQueryParser and in the test cases. Let me know if it isn't. You have the option of throwing an exception when a stopword is found to notify the user about stopwords, but that's exceedingly unsatisfactory. Without digging into the internals of SpanNearQuery, we can still do better on this. One proposal is to do what the basic highlighter does and risk false positives...behind the scenes modify calculator for evaluating to calculator evaluating~1. This would then falsely match calculator zebra evaluating. PhraseQuery can have false positives, too, but it guarantees that the false hit has to be a stop word. This solution would not do that. So, is this better than no matches at all? was (Author: talli...@mitre.org): The root of this problem is that SpanNearIQuery has no good way to handle stopwords in a way analagous to PhraseQuery. In SpanQueryParser, this limitation should be well described in the javadocs to SpanQueryParser and in the test cases. Let me know if it isn't. You have the option of throwing an exception when a stopword is found to notify the user about stopwords, but that's exceedingly unsatisfactory. Without digging into the internals of SpanNearQuery, we can still do better on this. One proposal is to do what the basic highlighter does and risk false positives...behind the scenes modify calculator for evaluating to calculator evaluating~1. This would then falsely match calculator zebra evaluating. PhraseQuery can have false positives, too, but it guarantees that the false hit has to be a stop word. This solution would not do that. So, is this better than no matches at all? [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser --- Key: LUCENE-5205 URL: https://issues.apache.org/jira/browse/LUCENE-5205 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Tim Allison Labels: patch Fix For: 4.7 Attachments: LUCENE-5205-cleanup-tests.patch, LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, LUCENE-5205_dateTestReInitPkgPrvt.patch, LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, SpanQueryParser_v1.patch.gz, patch.txt This parser extends QueryParserBase and includes functionality from: * Classic QueryParser: most of its syntax * SurroundQueryParser: recursive parsing for near and not clauses. * ComplexPhraseQueryParser: can handle near queries that include multiterms (wildcard, fuzzy, regex, prefix), * AnalyzingQueryParser: has an option to analyze multiterms. At a high level, there's a first pass BooleanQuery/field parser and then a span query parser handles all terminal nodes and phrases. Same as classic syntax: * term: test * fuzzy: roam~0.8, roam~2 * wildcard: te?t, test*, t*st * regex: /\[mb\]oat/ * phrase: jakarta apache * phrase with slop: jakarta apache~3 * default or clause: jakarta apache * grouping or clause: (jakarta apache) * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta * multiple fields: title:lucene author:hatcher Main additions in SpanQueryParser syntax vs. classic syntax: * Can require in order for phrases with slop with the \~ operator: jakarta apache\~3 * Can specify not near: fever bieber!\~3,10 :: find fever but not if bieber appears within 3 words before or 10 words after it. * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta apache\]~3 lucene\]\~4 :: find jakarta within 3 words of apache, and that hit has to be within four words before lucene * Can also use \[\] for single level phrasal queries instead of as in: \[jakarta apache\] * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 :: find apache and then either lucene or solr within three words. * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within two words of ap*che and that hit has to be within ten words of something like solr or that lucene regex. * Can require at least x number of hits at boolean level: apache AND (lucene solr tika)~2 * Can use negative only query: -jakarta :: Find all docs
[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907304#comment-13907304 ] Tim Allison edited comment on LUCENE-5205 at 2/20/14 7:03 PM: -- Code duplication. The biggest offenders are in test (I think, let me know if you disagree): 1) TestSpanQPBasedonQPTestBase...I can try to refactor this to extend QPTestBase, but that will require some reworking of QPTestBase as, and I didn't want to touch that (hence the duplication). It would also help to add a getQuery() to SpanMultitermQueryWrapper to test for equality...again, I didn't want to touch anything outside of the parser at the cost of duplication. 2) TestMultiAnalyzer. This relies on testing equality of string representations of queries. Will have to modify TestMultiAnalyzer in way similar to QPTestBase. 3) TestComplexPhraseQuery. Should be straightforward to extend the original, but will need to make checkMatches public so that I can override it. I'll also have to move the tests with slightly different syntax into a different test, but that's easy and would help declutter. There's other code duplication with AnalyzingQueryParser...should we break that functionality out into a helper class? Any other major duplication areas? Y, I don't like the reinit at all. The reason that's there was so that I could extend QueryParserBase, but I'm not sure that that decision buys much anymore. As I remember, it buys date parsing in range queries (which I'm now not sure I actually want) and addBoolean; there may be more, but I'm not sure there is. It would clean up a fair bit of code if I implement CommonQueryParserConfiguration instead of extending QueryParserBase. I'd still have to leave in some things that don't make sense for the SpanQueryParser, though: lowerCaseExpandedTerms, enablePositionIncrements. Another option would be to abandon CQPC, but I wanted this parser to at least implement that interface. Let me know what makes sense. As for the public base classes, y, those can go private for now. I made them public in case anyone wanted to extend them, but, as you point out, then I really ought to add javadocs and treat them as if they were public (which they are!). As for date/locale issues, I'll take a look. was (Author: talli...@mitre.org): Code duplication. The biggest offenders are in test (I think, let me know if you disagree): 1) TestSpanQPBasedonQPTestBase...I can try to refactor this to extend QPTestBase, but that will require some reworking of QPTestBase as, and I didn't want to touch that (hence the duplication). It would also help to add a getQuery() to SpanMultitermQueryWrapper to test for equality...again, I didn't want to touch anything outside of the parser at the cost of duplication. 2) TestMultiAnalyzer...not sure how not to duplicate. This relies on testing equality of string representations of queries. 3) TestComplexPhraseQuery. Should be straightforward to extend the original, but will need to make checkMatches public so that I can override it. I'll also have to move the tests with slightly different syntax into a different test, but that's easy and would help declutter. There's other code duplication with AnalyzingQueryParser...should we break that functionality out into a helper class? Any other major duplication areas? Y, I don't like the reinit at all. The reason that's there was so that I could extend QueryParserBase, but I'm not sure that that decision buys much anymore. As I remember, it buys date parsing in range queries (which I'm now not sure I actually want) and addBoolean; there may be more, but I'm not sure there is. It would clean up a fair bit of code if I implement CommonQueryParserConfiguration instead of extending QueryParserBase. I'd still have to leave in some things that don't make sense for the SpanQueryParser, though: lowerCaseExpandedTerms, enablePositionIncrements. Another option would be to abandon CQPC, but I wanted this parser to at least implement that interface. Let me know what makes sense. As for the public base classes, y, those can go private for now. I made them public in case anyone wanted to extend them, but, as you point out, then I really ought to add javadocs and treat them as if they were public (which they are!). As for date/locale issues, I'll take a look. [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser --- Key: LUCENE-5205 URL: https://issues.apache.org/jira/browse/LUCENE-5205 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Tim Allison Labels:
[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907354#comment-13907354 ] Tim Allison edited comment on LUCENE-5205 at 2/20/14 7:20 PM: -- {quote} I know this probably exists elsewhere, i swear there is something in QPBase doing this for range queries. {quote} Busted...yeah...I meant to include that in my list of duplication. Sorry. As for Reinit, you're absolutely right...no one is calling that...let's go with documentation and assert false. Will start some patches on the smaller stuff. Thank you, again! was (Author: talli...@mitre.org): {quote} I know this probably exists elsewhere, i swear there is something in QPBase doing this for range queries. {quote} Busted...yeah...I meant to include that in my list of duplication. Sorry. As for Reinit, you're absolutely right...no one is calling that...let's go with documentation and assert false. Will start some patches on the smaller stuff. Thank you, again! As [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser --- Key: LUCENE-5205 URL: https://issues.apache.org/jira/browse/LUCENE-5205 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Tim Allison Labels: patch Fix For: 4.7 Attachments: LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, SpanQueryParser_v1.patch.gz, patch.txt This parser extends QueryParserBase and includes functionality from: * Classic QueryParser: most of its syntax * SurroundQueryParser: recursive parsing for near and not clauses. * ComplexPhraseQueryParser: can handle near queries that include multiterms (wildcard, fuzzy, regex, prefix), * AnalyzingQueryParser: has an option to analyze multiterms. At a high level, there's a first pass BooleanQuery/field parser and then a span query parser handles all terminal nodes and phrases. Same as classic syntax: * term: test * fuzzy: roam~0.8, roam~2 * wildcard: te?t, test*, t*st * regex: /\[mb\]oat/ * phrase: jakarta apache * phrase with slop: jakarta apache~3 * default or clause: jakarta apache * grouping or clause: (jakarta apache) * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta * multiple fields: title:lucene author:hatcher Main additions in SpanQueryParser syntax vs. classic syntax: * Can require in order for phrases with slop with the \~ operator: jakarta apache\~3 * Can specify not near: fever bieber!\~3,10 :: find fever but not if bieber appears within 3 words before or 10 words after it. * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta apache\]~3 lucene\]\~4 :: find jakarta within 3 words of apache, and that hit has to be within four words before lucene * Can also use \[\] for single level phrasal queries instead of as in: \[jakarta apache\] * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 :: find apache and then either lucene or solr within three words. * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within two words of ap*che and that hit has to be within ten words of something like solr or that lucene regex. * Can require at least x number of hits at boolean level: apache AND (lucene solr tika)~2 * Can use negative only query: -jakarta :: Find all docs that don't contain jakarta * Can use an edit distance 2 for fuzzy query via SlowFuzzyQuery (beware of potential performance issues!). Trivial additions: * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, prefix =2) * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance =2: (jakarta~1 (OSA) vs jakarta~1(Levenshtein) This parser can be very useful for concordance tasks (see also LUCENE-5317 and LUCENE-5318) and for analytical search. Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery. Most of the documentation is in the javadoc for SpanQueryParser. Any and all feedback is welcome. Thank you. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907588#comment-13907588 ] Tim Allison edited comment on LUCENE-5205 at 2/20/14 10:01 PM: --- Back on the issue of reinventing analyzeMultiterm...part of that reinvention was because I was getting a setReader() in wrong state exception in one of my tests. With analyzeMultiterm in QueryParserBase as it stands, the token stream is not consumed if an exception is thrown. Therefore, next time you run the parser (with the same analyzer) you can get: {noformat} [junit4] Throwable #1: java.lang.AssertionError: setReader() called in wrong state: INCREMENT_FALSE [junit4]at __randomizedtesting.SeedInfo.seed([6E1DC3D6C716BC75:EDEE4C0E5E329586]:0) [junit4]at org.apache.lucene.analysis.MockTokenizer.setReaderTestPoint(MockTokenizer.java:266) [junit4]at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:92) [junit4]at org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:304) [junit4]at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:181) {noformat} Should I fix this in QueryParserBase or did I not build the test analyzer correctly? was (Author: talli...@mitre.org): Back on the issue of reinventing analyzeMultiterm...part of that reinvention was because I was getting a Analyzer.setReader() in wrong state exception in one of my tests. With analyzeMultiterm in QueryParserBase as it stands, the analyzer is not consumed if an exception is thrown. Therefore, next time you run the parser (with the same analyzer) you can get: {noformat} [junit4] Throwable #1: java.lang.AssertionError: setReader() called in wrong state: INCREMENT_FALSE [junit4]at __randomizedtesting.SeedInfo.seed([6E1DC3D6C716BC75:EDEE4C0E5E329586]:0) [junit4]at org.apache.lucene.analysis.MockTokenizer.setReaderTestPoint(MockTokenizer.java:266) [junit4]at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:92) [junit4]at org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:304) [junit4]at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:181) {noformat} Should I fix this in QueryParserBase or did I not build the test analyzer correctly? [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser --- Key: LUCENE-5205 URL: https://issues.apache.org/jira/browse/LUCENE-5205 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Tim Allison Labels: patch Fix For: 4.7 Attachments: LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, SpanQueryParser_v1.patch.gz, patch.txt This parser extends QueryParserBase and includes functionality from: * Classic QueryParser: most of its syntax * SurroundQueryParser: recursive parsing for near and not clauses. * ComplexPhraseQueryParser: can handle near queries that include multiterms (wildcard, fuzzy, regex, prefix), * AnalyzingQueryParser: has an option to analyze multiterms. At a high level, there's a first pass BooleanQuery/field parser and then a span query parser handles all terminal nodes and phrases. Same as classic syntax: * term: test * fuzzy: roam~0.8, roam~2 * wildcard: te?t, test*, t*st * regex: /\[mb\]oat/ * phrase: jakarta apache * phrase with slop: jakarta apache~3 * default or clause: jakarta apache * grouping or clause: (jakarta apache) * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta * multiple fields: title:lucene author:hatcher Main additions in SpanQueryParser syntax vs. classic syntax: * Can require in order for phrases with slop with the \~ operator: jakarta apache\~3 * Can specify not near: fever bieber!\~3,10 :: find fever but not if bieber appears within 3 words before or 10 words after it. * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta apache\]~3 lucene\]\~4 :: find jakarta within 3 words of apache, and that hit has to be within four words before lucene * Can also use \[\] for single level phrasal queries instead of as in: \[jakarta apache\] * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 :: find apache and then either lucene or solr within three words. * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within