[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

Mark Harwood (JIRA) Wed, 22 Jul 2009 08:17:40 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734148#action_12734148
 ]


Mark Harwood commented on LUCENE-1486:
--------------------------------------

I'll try and catch up with some of the issues raised here:

bq. What do you mean on the last check by phrase inside phrase, I don't see any 
phrase inside a phrase

Correct, the "inner phrase" example was a term not a phrase. This is perhaps a 
better example:

                checkBadQuery("\"jo* \"percival smith\" \""); //phrases inside 
phrases is bad

bq. I'm trying now to figure out what is supported 

The Junit is currently the main form of documentation - unlike the 
XMLQueryParser (which has a DTD) there is no syntax to formally capture the 
logic. 
Here is a basic summary of the syntax supported and how it differs from normal 
non-phrase use of the same operators:

* Wildcard/fuzzy/range clauses can be used to define a phrase element (as 
opposed to simply single terms)
* Brackets are used to group/define the acceptable variations for a given 
phrase element  e.g. "(john OR jonathon) smith" 
* "AND" is irrelevant - there is effectively an implied "AND_NEXT_TO" binding 
all phrase elements 

To move this forward I would suggest we consider following one of these options:

1) Keep in core and improve error reporting and documentation
2) Move into "contrib" as experimental 
3) Retain in core but simplify it to support only the simplest syntax (as in my 
Britney~ example)
4) Re-engineer the QueryParser.jj to support a formally defined syntax for 
acceptable "within phrase" operators e.g. *, ~, ( ) 

I think 1) is achievable if we carefully define where the existing parser 
breaks (e.g. ANDs and nested brackets)
2) is unnecessary if we can achieve 1).
3) would be a shame if we lost useful features for some very convoluted edge 
cases
4) is beyond my JavaCC skills.



















> Wildcards, ORs etc inside Phrase queries
> ----------------------------------------
>
>                 Key: LUCENE-1486
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1486
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>    Affects Versions: 2.4
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>               checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>               checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>               checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>               
>               checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>               checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>               checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

Reply via email to