[jira] [Commented] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser

Robert Muir (JIRA) Thu, 20 Feb 2014 11:12:12 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907339#comment-13907339
 ]


Robert Muir commented on LUCENE-5205:
-------------------------------------

{quote}
Code duplication. The biggest offenders are in test (I think, let me know if 
you disagree):
{quote}

I guess i wasn't even thinking about test code, i just see code like this 
method in AnalyzingQueryBase (as an example):

{code}
  protected BytesRef analyzeMultitermTermParseEx(String field, String part) 
throws ParseException {
{code}

I know this probably exists elsewhere, i swear there is something in QPBase 
doing this for range queries. :)

As far as test code:
{quote}
1) TestSpanQPBasedonQPTestBase...I can try to refactor this to extend 
QPTestBase, but that will require some reworking of QPTestBase as, and I didn't 
want to touch that (hence the duplication). It would also help to add a 
getQuery() to SpanMultitermQueryWrapper to test for equality...again, I didn't 
want to touch anything outside of the parser at the cost of duplication. 
{quote}

I worked on this last night. Basically i split up QPTestBase into two parts:
1. test harness, test analyzers, helper assertions, etc (QPTestCase)
2. actual concrete test methods for historical lucene behavior.

I fixed TestSpanQPBasedonQPTestBase to extend #1, I'm pretty happy with it? It 
removed all the duplicate test analyzers and assert methods and so on. Let me 
know if you have any concerns with what i did there.

{quote}
It would also help to add a getQuery() to SpanMultitermQueryWrapper to test for 
equality...again, I didn't want to touch anything outside of the parser at the 
cost of duplication. 
{quote}

Hmm, i guess i'm of the opposite mentality. I can see your point about not 
touching existing code, as it makes this contribution a 'standalone' change, 
but personally i'd rather us fix stuff like this! That being said, i did in 
fact recently add such a method while working on an unrelated issue 
(LUCENE-5415):
{code}
  /** Returns the wrapped query */
  public Query getWrappedQuery() {
    return query;
  }
{code}

Does this work? Want to upload a patch that uses this?

{quote}
3) TestComplexPhraseQuery. Should be straightforward to extend the original, 
but will need to make checkMatches public so that I can override it. I'll also 
have to move the tests with slightly different syntax into a different test, 
but that's easy and would help declutter.
{quote}

Ok, this sounds good.

{quote}
There's other code duplication with AnalyzingQueryParser...should we break that 
functionality out into a helper class?
{quote}

Sounds like a good idea, i havent looked at the two, but seems worthwhile.

{quote}
Y, I don't like the reinit at all. The reason that's there was so that I could 
extend QueryParserBase, but I'm not sure that that decision buys much anymore. 
As I remember, it buys date parsing in range queries (which I'm now not sure I 
actually want) and addBoolean; there may be more, but I'm not sure there is.

It would clean up a fair bit of code if I implement 
CommonQueryParserConfiguration instead of extending QueryParserBase. I'd still 
have to leave in some things that don't make sense for the SpanQueryParser, 
though: lowerCaseExpandedTerms, enablePositionIncrements. Another option would 
be to abandon CQPC, but I wanted this parser to at least implement that 
interface. Let me know what makes sense. 
{quote}

I think the current base class is fine. I guess my question is, who is calling 
reinit :)

{quote}
As for the public base classes, y, those can go private for now. I made them 
public in case anyone wanted to extend them, but, as you point out, then I 
really ought to add javadocs and treat them as if they were public (which they 
are!).
{quote}

sounds good, maybe you want to upload a patch making the appropriate things 
pkg-private?  We can always open things back up later, if we need.




> [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to 
> classic QueryParser
> -----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5205
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5205
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/queryparser
>            Reporter: Tim Allison
>              Labels: patch
>             Fix For: 4.7
>
>         Attachments: LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, 
> LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, 
> SpanQueryParser_v1.patch.gz, patch.txt
>
>
> This parser extends QueryParserBase and includes functionality from:
> * Classic QueryParser: most of its syntax
> * SurroundQueryParser: recursive parsing for "near" and "not" clauses.
> * ComplexPhraseQueryParser: can handle "near" queries that include multiterms 
> (wildcard, fuzzy, regex, prefix),
> * AnalyzingQueryParser: has an option to analyze multiterms.
> At a high level, there's a first pass BooleanQuery/field parser and then a 
> span query parser handles all terminal nodes and phrases.
> Same as classic syntax:
> * term: test 
> * fuzzy: roam~0.8, roam~2
> * wildcard: te?t, test*, t*st
> * regex: /\[mb\]oat/
> * phrase: "jakarta apache"
> * phrase with slop: "jakarta apache"~3
> * default "or" clause: jakarta apache
> * grouping "or" clause: (jakarta apache)
> * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
> * multiple fields: title:lucene author:hatcher
>  
> Main additions in SpanQueryParser syntax vs. classic syntax:
> * Can require "in order" for phrases with slop with the \~> operator: 
> "jakarta apache"\~>3
> * Can specify "not near": "fever bieber"!\~3,10 ::
>     find "fever" but not if "bieber" appears within 3 words before or 10 
> words after it.
> * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta 
> apache\]~3 lucene\]\~>4 :: 
>     find "jakarta" within 3 words of "apache", and that hit has to be within 
> four words before "lucene"
> * Can also use \[\] for single level phrasal queries instead of " as in: 
> \[jakarta apache\]
> * Can use "or grouping" clauses in phrasal queries: "apache (lucene solr)"\~3 
> :: find "apache" and then either "lucene" or "solr" within three words.
> * Can use multiterms in phrasal queries: "jakarta\~1 ap*che"\~2
> * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ 
> /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like "jakarta" within two 
> words of "ap*che" and that hit has to be within ten words of something like 
> "solr" or that "lucene" regex.
> * Can require at least x number of hits at boolean level: "apache AND (lucene 
> solr tika)~2
> * Can use negative only query: -jakarta :: Find all docs that don't contain 
> "jakarta"
> * Can use an edit distance > 2 for fuzzy query via SlowFuzzyQuery (beware of 
> potential performance issues!).
> Trivial additions:
> * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, 
> prefix =2)
> * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance 
> <=2: (jakarta~1 (OSA) vs jakarta~>1(Levenshtein)
> This parser can be very useful for concordance tasks (see also LUCENE-5317 
> and LUCENE-5318) and for analytical search.  
> Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery.
> Most of the documentation is in the javadoc for SpanQueryParser.
> Any and all feedback is welcome.  Thank you.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser

Reply via email to