[jira] [Comment Edited] (JENA-1506) Add configurable filters and tokenizers

Code Ferret (JIRA) Fri, 16 Mar 2018 06:57:37 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16401021#comment-16401021
 ]


Code Ferret edited comment on JENA-1506 at 3/16/18 1:56 PM:
------------------------------------------------------------

The [PR|https://github.com/apache/jena/pull/385] extends the 
{{text:defineAnalyzers}} feature to allow definition of tokenizers and filters 
that can then be referenced in the definition of a 
{{text:ConfigurableAnalyzer}}. The parameter types supported are as currently 
documented for {{GenericAnalyzer}}.

While the current definition of {{SelectiveFoldingFilter}} uses a 
{{List<Character>}} parameter I am hoping that can be replaced by 
{{org.apache.lucene.analysis.CharArraySet}} which is one of the currently 
supported parameter types.

The definition of a {{SelectiveFoldingFilter}} would be like:
{code:java}
      text:defineAnalyzers (
          [text:defineFilter <#bar> ;
           text:filter [
             a text:GenericFilter ;
             text:class 
"org.apache.jena.query.text.filter.SelectiveFoldingFilter" ;
             text:params (
                  [ text:paramName "whitelisted" ;
                    text:paramType text:TypeSet ;
                    text:paramValue ("ç") ]
                  )
            ]
          ]
      )
{code}
The following is a more complete example of the full set of features:
{code:java}
:indexLucene
    a text:TextIndexLucene ;
    text:directory "mem" ;
    text:storeValues true ;
    text:analyzer [
         a text:DefinedAnalyzer ;
         text:useAnalyzer :configuredAnalyzer ] ;
    text:defineAnalyzers (
         [ text:defineAnalyzer :configuredAnalyzer ;
           text:analyzer [
                a text:ConfigurableAnalyzer ;
                text:tokenizer :ngram ;
                text:filters ( :asciiff text:LowerCaseFilter ) ] ]
         [ text:defineTokenizer :ngram ;
           text:tokenizer [
                a text:GenericTokenizer ;
                text:class "org.apache.lucene.analysis.ngram.NGramTokenizer" ;
                text:params (
                     [ text:paramName "minGram" ;
                       text:paramType text:TypeInt ;
                       text:paramValue 3 ]
                     [ text:paramName "maxGram" ;
                       text:paramType text:TypeInt ;
                       text:paramValue 7 ]
                     ) ] ]
         [ text:defineFilter :asciiff ;
           text:filter [
                a text:GenericFilter ;
                text:class 
"org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter" ;
                text:params (
                     [ text:paramName "preserveOriginal" ;
                       text:paramType text:TypeBoolean ;
                       text:paramValue true ]
                     ) ] ]
         ) ;
    text:entityMap :entMap ;
    .
{code}
The example illustrates defining a {{ConfigurableAnalyzer}}, 
{{:configuredAnalyzer}}, that uses the definition of an {{NGramTokenizer}}, 
{{:ngram}}, and an {{ASCIIFoldingFilter}}, {{:asciiff}}. In turn the 
{{:configuredAnalyzer}} is used to define the default analyzer to be used 
during indexing.

The implementation ensures that tokenizer and filter definitions are processed 
prior to processing analyzer definitions after which the analyzer and query 
analyzer statements are processed. This permits using analyzer definitions in 
the {{text:TextIndexLucene}} section as well as in the entity map.

For {{Filter}}s the {{TokenStream}} parameter is supplied dynamically as the 
first parameter _behind the scenes_ so it is not included in the 
{{text:params}} list – this is why there is no type for {{TokenStream}} defined.

I have included a {{TestTextDefineAnalyzers}} that exercises the assembler 
machinery and ensures that a usable dataset results. There are already a suite 
of tests associated with {{TestDatasetWithConfigurableAnalyzer}} that exercise 
the new logic for identifying tokenizers and filters and instantiating them so 
this part of the extension is tested there.

I'll update the documentation once the PR is completed.


was (Author: code-ferret):
The [PR|https://github.com/apache/jena/pull/385] extends the 
{{text:defineAnalyzers}} feature to allow definition of tokenizers and filters 
that can then be referenced in the definition of a 
{{text:ConfigurableAnalyzer}}. The parameter types supported are as currently 
documented for {{GenericAnalyzer}}.

While the current definition of {{SelectiveFoldingFilter}} uses a 
{{List<Character>}} parameter I am hoping that can be replaced by 
{{org.apache.lucene.analysis.CharArraySet}} which is one of the currently 
supported parameter types.

The definition of a {{SelectiveFoldingFilter}} would be like:
{code:java}
      text:defineAnalyzers (
          [text:defineFilter <#bar> ;
           text:filter [
             a text:GenericFilter ;
             text:class 
"org.apache.jena.query.text.filter.SelectiveFoldingFilter" ;
             text:params (
                  [ text:paramName "whitelisted" ;
                    text:paramType text:TypeSet ;
                    text:paramValue ("ç") ]
                  )
            ]
          ]
      )
{code}
The following is a more complete example of the full set of features:
{code:java}
:indexLucene
    a text:TextIndexLucene ;
    text:directory "mem" ;
    text:storeValues true ;
    text:analyzer [
         a text:DefinedAnalyzer ;
         text:useAnalyzer :configuredAnalyzer ] ;
    text:defineAnalyzers (
         [ text:defineAnalyzer :configuredAnalyzer ;
           text:analyzer [
                a text:ConfigurableAnalyzer ;
                text:tokenizer :ngram ;
                text:filters ( :asciiff text:LowerCaseFilter ) ] ]
         [ text:defineTokenizer :ngram ;
           text:tokenizer [
                a text:GenericTokenizer ;
                text:class "org.apache.lucene.analysis.ngram.NGramTokenizer" ;
                text:params (
                     [ text:paramName "minGram" ;
                       text:paramType text:TypeInt ;
                       text:paramValue 3 ]
                     [ text:paramName "maxGram" ;
                       text:paramType text:TypeInt ;
                       text:paramValue 7 ]
                     ) ] ]
         [ text:defineFilter :asciiff ;
           text:filter [
                a text:GenericFilter ;
                text:class 
"org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter" ;
                text:params (
                     [ text:paramName "preserveOriginal" ;
                       text:paramType text:TypeBoolean ;
                       text:paramValue true ]
                     ) ] ]
         ) ;
    text:entityMap :entMap ;
    .
{code}
The example illustrates defining a {{ConfigurableAnalyzer}}, 
{{:configuredAnalyzer}}, that uses the definition of an {{NGramTokenizer}}, 
{{:ngram}}, and an {{ASCIIFoldingFilter}}, {{:asciiff}}. In turn the 
{{:configuredAnalyzer}} is used to define the default analyzer to be used 
during indexing.

The implementation ensures that tokenizer and filter definitions are processed 
prior to processing analyzer definitions after which the analyzer and query 
analyzer statements are processed. This permits using analyzer definitions in 
the {{text:TextIndexLucene}} section as well as in the entity map.

I have included a {{TestTextDefineAnalyzers}} that exercises the assembler 
machinery and ensures that a usable dataset results.

I'll update the documentation once the PR is completed.

> Add configurable filters and tokenizers
> ---------------------------------------
>
>                 Key: JENA-1506
>                 URL: https://issues.apache.org/jira/browse/JENA-1506
>             Project: Apache Jena
>          Issue Type: New Feature
>          Components: Text
>    Affects Versions: Jena 3.7.0
>            Reporter: Code Ferret
>            Priority: Major
>
> In support of [Jena-1488|https://issues.apache.org/jira/browse/JENA-1488], 
> this issue proposes to add a feature to allow including defined filters and 
> tokenizers, similar to {{DefinedAnalyzer}}, for the {{ConfigurableAnalyzer}}, 
> allowing configurable arguments such as the {{excludeChars}}. I've looked at 
> {{ConfigurableAnalyzer}} and its assembler and it should be straightforward.
> I would add tokenizer and filter definitions to {{TextIndexLucene}} similar 
> to the support for adding analyzers:
> {code:java}
>     text:defineFilters (
>         [ text:defineFilter <#foo> ; 
>           text:filter [ 
>             a text:GenericFilter ;
>             text:class "fi.finto.FoldingFilter" ;
>             text:params (
>                 [ text:paramName "excludeChars" ;
>                   text:paramType text:TypeString ; 
>                   text:paramValue "whatevercharstoexclude" ]
>                 )
>             ] ; 
>           ]
>       )
> {code}
> {{GenericFilterAssembler}} and {{GenericTokenizerAssmbler}} would make use of 
> much of the code in {{GenericAnalyzerAssembler}}. The changes to 
> {{ConfigurableAnalyzer}} and {{ConfigurableAnalyzerAssembler}} are 
> straightforward and mostly involve retaining the resource URI rather than 
> extracting the localName.
> Such an addition will make it easy to create new tokenizers and filters that 
> could be dropped in by just adding the classes onto the jena/fuseki classpath 
> or by referring to ones already included in Jena (via Lucene or otherwise) 
> and putting the appropriate assembler bits in the configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (JENA-1506) Add configurable filters and tokenizers

Reply via email to