[
https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906267#comment-13906267
]
Robert Muir commented on LUCENE-5205:
-------------------------------------
OK I took a look, i put the current state in a branch
(https://svn.apache.org/repos/asf/lucene/dev/branches/lucene5205). This way, we
dont have to fling enormous patches around on this issue.
A few things i noticed and fixed (or tried to):
* formatting: e.g. no space between parens or exception names and braces:
* whitespace: e.g. adding a line between each method, between apache license
header and imports, and so on
* one test class didn't extend lucenetestcase
* javadocs errors (these show up as compile errors in my IDE), e.g. \@param foo
with no actual text
In the process of going thru the code, i have some initial concerns, maybe we
can figure out how to address:
* public classes with no javadocs. if its important enough to be a public
class, we should at least have a javadoc on it saying what it does.
* should we just nuke the spans.tokens package? This doesn't seem useful to any
end user and folding this into .spans as package-private classes could greatly
reduce the API surface area.
* lots of code copy-pasted from elsewhere (maybe with tweaks). this includes
test code. we should try to do some refactoring of this. I can try to look at
the tests tonight, and see if I can improve that side.
Any patches welcome against the branch.
> [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to
> classic QueryParser
> -----------------------------------------------------------------------------------------------
>
> Key: LUCENE-5205
> URL: https://issues.apache.org/jira/browse/LUCENE-5205
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/queryparser
> Reporter: Tim Allison
> Labels: patch
> Fix For: 4.7
>
> Attachments: LUCENE-5205.patch.gz, LUCENE-5205.patch.gz,
> LUCENE_5205.patch, SpanQueryParser_v1.patch.gz, patch.txt
>
>
> This parser extends QueryParserBase and includes functionality from:
> * Classic QueryParser: most of its syntax
> * SurroundQueryParser: recursive parsing for "near" and "not" clauses.
> * ComplexPhraseQueryParser: can handle "near" queries that include multiterms
> (wildcard, fuzzy, regex, prefix),
> * AnalyzingQueryParser: has an option to analyze multiterms.
> At a high level, there's a first pass BooleanQuery/field parser and then a
> span query parser handles all terminal nodes and phrases.
> Same as classic syntax:
> * term: test
> * fuzzy: roam~0.8, roam~2
> * wildcard: te?t, test*, t*st
> * regex: /\[mb\]oat/
> * phrase: "jakarta apache"
> * phrase with slop: "jakarta apache"~3
> * default "or" clause: jakarta apache
> * grouping "or" clause: (jakarta apache)
> * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
> * multiple fields: title:lucene author:hatcher
>
> Main additions in SpanQueryParser syntax vs. classic syntax:
> * Can require "in order" for phrases with slop with the \~> operator:
> "jakarta apache"\~>3
> * Can specify "not near": "fever bieber"!\~3,10 ::
> find "fever" but not if "bieber" appears within 3 words before or 10
> words after it.
> * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta
> apache\]~3 lucene\]\~>4 ::
> find "jakarta" within 3 words of "apache", and that hit has to be within
> four words before "lucene"
> * Can also use \[\] for single level phrasal queries instead of " as in:
> \[jakarta apache\]
> * Can use "or grouping" clauses in phrasal queries: "apache (lucene solr)"\~3
> :: find "apache" and then either "lucene" or "solr" within three words.
> * Can use multiterms in phrasal queries: "jakarta\~1 ap*che"\~2
> * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~
> /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like "jakarta" within two
> words of "ap*che" and that hit has to be within ten words of something like
> "solr" or that "lucene" regex.
> * Can require at least x number of hits at boolean level: "apache AND (lucene
> solr tika)~2
> * Can use negative only query: -jakarta :: Find all docs that don't contain
> "jakarta"
> * Can use an edit distance > 2 for fuzzy query via SlowFuzzyQuery (beware of
> potential performance issues!).
> Trivial additions:
> * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1,
> prefix =2)
> * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance
> <=2: (jakarta~1 (OSA) vs jakarta~>1(Levenshtein)
> This parser can be very useful for concordance tasks (see also LUCENE-5317
> and LUCENE-5318) and for analytical search.
> Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery.
> Most of the documentation is in the javadoc for SpanQueryParser.
> Any and all feedback is welcome. Thank you.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]