[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

Adriano Crestani (JIRA) Wed, 11 Nov 2009 23:34:13 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776884#action_12776884
 ]


Adriano Crestani commented on LUCENE-2039:
------------------------------------------

This is a new feature already suggested by Luis and Shai (maybe others too) 
before, the ability to delegate to another parser the syntax processing of 
certain piece of the query string. This feature is a new feature to both: core 
QP and contrib QP.

So, I think we should focus more on how/when a query substring will be 
delegated to another parser and not discuss about how/when any logic will be 
applied to it. I think in both QPs, this part is already defined.

First, to identify this substring we would need a open and close token. It 
could be either double-quote, slash or whatever. The ideal solution would allow 
the user to specify these two tokens. Unfortunately, I think JavaCC is not so 
flexible to allow defining these tokens programatically (after parser 
generation by JavaCC). So we need to stick with some specific open/close token, 
that's one decision we need to take. Maybe we could provide a property file, 
where the user could specify the open/close token and regenerate Lucene QP 
using 'ant javacc' (which is pretty easy today). Anyway, by default, we could 
use any new token. I don't agree with double-quotes (as I think someone 
suggested), it's already used by phrases, so, slash is fine for me, as already 
defined in Simon's patch.

Now, about any semantic(logic) processing performed on any query substring, it 
will be up to the QP implementation. In the core QP, its own extension would be 
responsible to do this processing. In the contrib QP, the extension parser 
would only parse the substring and return a QueryNode, which will be later 
processed, after the syntax parsing is complete, by the query node processors. 
As I said before, this part is defined and I don't think we should discuss it 
on this topic.

I like Simon's patch, I think the same approach can be applied to the contrib 
QP. The only part I disagree is when you pass the fieldname to the extension 
parser, I wouldn't implement that on the contrib parser, because it assumes the 
syntax always has field names. Anyway, for the core QP, I see the reason why 
you pass the fieldname, and it's completely related to the way the core QP 
implements the semantic (logic) processing. So, in future, if the main core QP 
needs to pass a new info to its extension parser, the extension parser 
interface would have to be changed :S...here I go again starting a new 
discussion about how semantic (logic) processing should be handled :P

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries 
> living in core, adding other queries or extending the parser in any way 
> always forced people to change the grammar file and regenerate. Even if you 
> change the grammar you have to be extremely careful how you modify the parser 
> so that other parts of the standard parser are affected by customisation 
> changes. Eventually you had to live with all the limitation the current 
> parser has like tokenizing on whitespaces before a tokenizer / analyzer has 
> the chance to look at the tokens. 
> I was thinking about how to overcome the limitation and add regex support to 
> the query parser without introducing any dependency to core. I added a new 
> special character that basically prevents the parser from interpreting any of 
> the characters enclosed in the new special characters. I choose the forward 
> slash  '/' as the delimiter so that everything in between two forward slashes 
> is basically escaped and ignored by the parser. All chars embedded within 
> forward slashes are treated as one token even if it contains other special 
> chars like * []?{} or whitespaces. This token is subsequently passed to a 
> pluggable "parser extension" with builds a query from the embedded string. I 
> do not interpret the embedded string in any way but leave all the subsequent 
> work to the parser extension. Such an extension could be another full 
> featured query parser itself or simply a ctor call for regex query. The 
> interface remains quiet simple but makes the parser extendible in an easy way 
> compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char 
> into the syntax but I guess that would not be that much of a deal as it is 
> reflected in the escape method though. It would truly be nice to have more 
> than once extension an have this even more flexible so treat this patch as a 
> kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK 
> version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ... 
> }
> {code}
> which I would like better as it would be more consistent with the idea of the 
> query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based 
> approach I guess I will add a second patch with regex in core soon too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

Reply via email to