[jira] Issue Comment Edited: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

Luis Alves (JIRA) Tue, 10 Nov 2009 14:36:00 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776123#action_12776123
 ]


Luis Alves edited comment on LUCENE-2039 at 11/10/09 10:33 PM:
---------------------------------------------------------------

Hi Simon, 

I think one problem lucene has today, is that the queryparser code in very 
tightly integrated with the javacc code. If we continue to do that it will 
always be very difficult to create a standard way of making small changes to 
the current queryparser.

I like the implementation proposed by Simon, is very similar to the opaque term 
idea, but I would prefer not to overload the fileds names.
{quote}
The alternative idea is to utilize the fact that queries enclosed in double 
quotes are passed to getFieldQuery() and are not interpreted by the grammar. 
Extension queries could be embedded in quotes while the content needs to be 
escaped. (that is already the case though. To identify which extension should 
be used we could utilize the field name and a pattern so that users could plug 
in extension mapped to some field name pattern. Something like: re_field:"^.*$" 
-> (re_field, RegexExtension)
{quote}

We should decouple the user extensions from the JAVACC generated code. Just 
like in the new queryparser framework does, the queryparser should allow for 
the user to register these extensions at run time, and have Interface that 
extensions should implement.

For example, something like this:
{code}
QueryParser  qp = QueryParserFactory.getInstance("3.0");
qp.registerOpaqueTerm("regexp", new QueryParserRegExpParser());
qp.registerOpaqueTerm("complex_phrases", new QueryParserComplexPhraseParser());
...
qp.parser(" regexp:\"/blah*/\" complex_phrase:\"(sun OR sunny) sky\" ",...);
{code}
Of course this is not possible with the lucene queryparser code today :(,
but this is the idea I think we should try to implement.

For the problem of field overload:
In your proposal we lose the field name information for the extensions, so we 
need to another solution that would allow the fieldname to be available for the 
extensions.

Here is another idea, that would allow for fieldnames not to be overloaded,
and allow regular term or phrase syntax for extensions.
{code}
syntax:
extension:fieldname:"syntax"

examples:
regexp:title:"/blah[a-z]+[0-9]+/"  <- regexp extension, title index field
complex_phrase:title:"(sun OR sunny) sky" <- complex_phrase extension, title 
index field

regexp_phrase::"/blah[a-z]+[0-9]+/"  <- regexp extension, default field
complex_phrase::"(sun OR sunny) sky" <- complex_phrase extension, default field

title:"blah" <- regular field query

{code}



      was (Author: lafa):
    Hi Simon, 

I think one problem lucene has today, is that the queryparser code in very 
tightly integrated with the javacc code. If we continue to do that it will 
always be very difficult to create a standard way of making small changes to 
the current queryparser.

I like the implementation proposed by Simon, is very similar to the opaque term 
idea, but I would prefer not to overload the fileds names.
{quote}
The alternative idea is to utilize the fact that queries enclosed in double 
quotes are passed to getFieldQuery() and are not interpreted by the grammar. 
Extension queries could be embedded in quotes while the content needs to be 
escaped. (that is already the case though. To identify which extension should 
be used we could utilize the field name and a pattern so that users could plug 
in extension mapped to some field name pattern. Something like: re_field:"^.*$" 
-> (re_field, RegexExtension)
{quote}

We should decouple the user extensions from the JAVACC generated code. Just 
like in the new queryparser framework, the queryparser should allow for the 
user to register these extensions at run time, and have Interface that 
implement that extensions should implement.

For example, something like this:
{code}
QueryParser  qp = QueryParserFactory.getInstance("3.0");
qp.registerOpaqueTerm("regexp", new QueryParserRegExpParser());
qp.registerOpaqueTerm("complex_phrases", new QueryParserComplexPhraseParser());
...
qp.parser(" regexp:\"/blah*/\" complex_phrase:\"(sun OR sunny) sky\" ",...);
{code}
Of course this is not possible with the lucene queryparser code today :(,
but this is the idea I think we should try to implement.

For the problem of field overload, is that we lose the field name information 
for the extensions, so we need to another solution that would allow the 
fieldname to be available for the extensions.

Here is another idea, that would allow for fieldnames not to be overloaded,
and allow regular term or phrase syntax for extensions.
{code}
syntax:
extension:fieldname:"syntax"

examples:
regexp:title:"/blah[a-z]+[0-9]+/"  <- regexp extension, title index field
complex_phrase:title:"(sun OR sunny) sky" <- complex_phrase extension, title 
index field

regexp_phrase::"/blah[a-z]+[0-9]+/"  <- regexp extension, default field
complex_phrase::"(sun OR sunny) sky" <- complex_phrase extension, default field

title:"blah" <- regular field query

{code}


  
> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries 
> living in core, adding other queries or extending the parser in any way 
> always forced people to change the grammar file and regenerate. Even if you 
> change the grammar you have to be extremely careful how you modify the parser 
> so that other parts of the standard parser are affected by customisation 
> changes. Eventually you had to live with all the limitation the current 
> parser has like tokenizing on whitespaces before a tokenizer / analyzer has 
> the chance to look at the tokens. 
> I was thinking about how to overcome the limitation and add regex support to 
> the query parser without introducing any dependency to core. I added a new 
> special character that basically prevents the parser from interpreting any of 
> the characters enclosed in the new special characters. I choose the forward 
> slash  '/' as the delimiter so that everything in between two forward slashes 
> is basically escaped and ignored by the parser. All chars embedded within 
> forward slashes are treated as one token even if it contains other special 
> chars like * []?{} or whitespaces. This token is subsequently passed to a 
> pluggable "parser extension" with builds a query from the embedded string. I 
> do not interpret the embedded string in any way but leave all the subsequent 
> work to the parser extension. Such an extension could be another full 
> featured query parser itself or simply a ctor call for regex query. The 
> interface remains quiet simple but makes the parser extendible in an easy way 
> compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char 
> into the syntax but I guess that would not be that much of a deal as it is 
> reflected in the escape method though. It would truly be nice to have more 
> than once extension an have this even more flexible so treat this patch as a 
> kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK 
> version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ... 
> }
> {code}
> which I would like better as it would be more consistent with the idea of the 
> query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based 
> approach I guess I will add a second patch with regex in core soon too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Issue Comment Edited: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

Reply via email to