[jira] [Commented] (JENA-1134) Support alternative QueryParsers in jena-text

ASF GitHub Bot (JIRA) Wed, 30 Mar 2016 01:37:16 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217659#comment-15217659
 ]


ASF GitHub Bot commented on JENA-1134:
--------------------------------------

GitHub user osma opened a pull request:

    https://github.com/apache/jena/pull/131

    JENA-1134: support AnalyzingQueryParser in jena-text

    This PR makes it possible to select either the standard Lucene QueryParser 
or the AnalyzingQueryParser using jena-text configuration like this:
    
    ```
    <#indexLucene> a text:TextIndexLucene ;
        text:directory <file:Lucene> ;
        text:queryParser text:AnalyzingQueryParser ;
        text:queryAnalyzer [
            a text:ConfigurableAnalyzer ;
            text:tokenizer text:KeywordTokenizer ;
            text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
        ] 
        text:entityMap <#entMap> ;
    ```
    
    The main difference between these query parsers is that 
AnalyzingQueryParser performs analysis also for wildcard queries. For example, 
if you use ASCIIFoldingFilter as above, if you want a search for `édu*` to 
match `éducation` you need AnalyzingQueryParser.
    
    One problem I had with the implementation is that the query parser needs to 
be constructed dynamically for every query, so I need to store the information 
about which query parser to use instead of just storing the 
QueryParser/AnalyzingQueryParser instance directly. I solved this by simply 
storing the type of query parser as a string, i.e. either `"QueryParser"` or 
`"AnalyzingQueryParser"`, and then dynamically construct the correct type of 
parser based on this information. I'm sure there are more elegant ways of doing 
this, e.g. creating Factories for each parser type and saving the correct kind 
of Factory, but I don't want to overengineer. Opinions?
    
    This could rather easily be extended to other query parser types supported 
by Lucene, though I'm unsure how useful that would be in practice. 
ComplexPhraseQueryParser and/or PrecedenceQueryParser could perhaps be useful 
to somebody.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/osma/jena jena-text-queryparser

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/jena/pull/131.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #131
    
----
commit 547d4ac64e4331e45cba96e045345a5f3ab214a7
Author: Osma Suominen <[email protected]>
Date:   2016-03-29T14:23:27Z

    simplify parseQuery and preParseQuery: get rid of primaryField argument as 
it is always the same

commit 22a81f8cbc9498cbe4f1970115aa32f9c21fb239
Author: Osma Suominen <[email protected]>
Date:   2016-03-30T08:09:02Z

    JENA-1134: basic support for AnalyzingQueryParser

----


> Support alternative QueryParsers in jena-text
> ---------------------------------------------
>
>                 Key: JENA-1134
>                 URL: https://issues.apache.org/jira/browse/JENA-1134
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: Text
>            Reporter: Osma Suominen
>            Assignee: Osma Suominen
>
> Jena-text is currently hardwired to use Lucene QueryParser. This parser is 
> (intentionally) limited so that it doesn't analyze wildcard queries. Instead 
> they will be expanded directly.
> This is a problem if you want to do accent-insensitive wildcard queries 
> (using ASCIIFoldingFilter) or other wildcard queries which rely on a special 
> analyzer. However, Lucene offers an alternate parser, AnalyzingQueryParser, 
> that could be used in such cases.
> I'd like to extend jena-text with a configuration parameter that allows using 
> AnalyzingQueryParser instead of the standard QueryParser. For example, the 
> configuration could look like this:
> {noformat}
> <#indexLucene> a text:TextIndexLucene ;
>     text:directory <file:Lucene> ;
>     text:queryParser text:AnalyzingQueryParser ;
>     text:queryAnalyzer [
>         a text:ConfigurableAnalyzer ;
>         text:tokenizer text:KeywordTokenizer ;
>         text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>     ] 
>     text:entityMap <#entMap> ;
> {noformat}
> I've written some very preliminary code to implement this, but I'm not yet 
> satisfied with it. It's a bit problematic because the parser cannot be 
> constructed in advance but must be dynamically created separately for each 
> query (because it needs parameters that can differ between queries). 
> Thus the TextIndexConfig must store information about which parser variant to 
> use, but not the actual QueryParser/AnalyzingQueryParser instance. This isn't 
> rocket science though, maybe some kind of Factory pattern would work.
> For some background for why this is needed, see this Skosmos issue:
> https://github.com/NatLibFi/Skosmos/issues/424



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JENA-1134) Support alternative QueryParsers in jena-text

Reply via email to