[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

Luis Alves (JIRA) Wed, 11 Nov 2009 13:29:15 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776668#action_12776668
 ]


Luis Alves commented on LUCENE-2039:
------------------------------------

Hi Yonik,

{quote}
This almost seems more of an issue for core lucene developers - it's an 
annoyance that one needs to recompile the javacc grammar when just tweaking 
what one of the methods does. Seems like this could easily be solved by just 
separating into two files... the javacc grammar would have a base class that 
left things like getFieldQuery() unimplemented, and then the standard 
QueryParser (in a different java file) would override and implement those 
methods.
{quote}

This solution does not fix the problem of having multiple syntaxes sharing the 
same lucene processing code. For example if you have one javacc grammar and one 
in antlr, you can't use lucene QueryParser, to process the output of both. You 
will need to re-implement the QueryParser recursive logic in a diff class to be 
able to use antlr.

{quote}
It already is today via subclassing QueryParser and overriding methods like 
getFieldQuery... that's very simple for users to understand and to leverage.
{quote}

True. This is simple, but is not customizable.
- You can't change the syntax.
- You can't reuse the QueryParser logic with other parsers
- If you do have to change syntax, you can't reuse QueryParser class anymore, 
you need to maintain your own copy of the class.

You can read LUCENE-1567 to understand the reasons for the new queryparser.
But the focus of the new queryparser is extensibility and customization,
without changing lucene code, but reusing lucene logic as much as possible.

If you look at TestSpanQueryParserSimpleSample in queryparser contrib, or 
LUCENE-1938 Precedence query parser.
It illustrates two cases that would be very difficult to do in the current 
QueryParser in lucene by overriding methods.

Actually the a implementation  PrecedenceQueryParser exists today in 
contrib/misc. That contains a seperated javacc grammar and does not share any 
code with the main lucene Queryparser, and it illustrates the problem I 
described above (code duplication, impossible to reuse if grammar is different, 
easily gets outdated when the core queryparser changes)

I'm not trying to say the QueryParser in main is worst than the one in contrib,

What I'm trying to describe is that the one in contrib is more modular and if 
we build the modules
for the lucene users. The users will be able to build smarter and more 
sophisticated solutions using Lucene in less time.
Users can decide what modules to use in the queryparser and build their query 
pipelines with less work.

Users can also use the pre-built ones like StandardQueryParser or 
PrecedenceQueryParser, these should be as easy to use as the old queryparser in 
main.



> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
>                 Key: LUCENE-2039
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2039
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>            Reporter: Simon Willnauer
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries 
> living in core, adding other queries or extending the parser in any way 
> always forced people to change the grammar file and regenerate. Even if you 
> change the grammar you have to be extremely careful how you modify the parser 
> so that other parts of the standard parser are affected by customisation 
> changes. Eventually you had to live with all the limitation the current 
> parser has like tokenizing on whitespaces before a tokenizer / analyzer has 
> the chance to look at the tokens. 
> I was thinking about how to overcome the limitation and add regex support to 
> the query parser without introducing any dependency to core. I added a new 
> special character that basically prevents the parser from interpreting any of 
> the characters enclosed in the new special characters. I choose the forward 
> slash  '/' as the delimiter so that everything in between two forward slashes 
> is basically escaped and ignored by the parser. All chars embedded within 
> forward slashes are treated as one token even if it contains other special 
> chars like * []?{} or whitespaces. This token is subsequently passed to a 
> pluggable "parser extension" with builds a query from the embedded string. I 
> do not interpret the embedded string in any way but leave all the subsequent 
> work to the parser extension. Such an extension could be another full 
> featured query parser itself or simply a ctor call for regex query. The 
> interface remains quiet simple but makes the parser extendible in an easy way 
> compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char 
> into the syntax but I guess that would not be that much of a deal as it is 
> reflected in the escape method though. It would truly be nice to have more 
> than once extension an have this even more flexible so treat this patch as a 
> kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK 
> version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
>   ... 
> }
> {code}
> which I would like better as it would be more consistent with the idea of the 
> query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based 
> approach I guess I will add a second patch with regex in core soon too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

Reply via email to