[jira] [Updated] (JENA-1326) Generic Lucene Analyzers

Code Ferret (JIRA) Sun, 23 Apr 2017 09:07:17 -0700

     [ 
https://issues.apache.org/jira/browse/JENA-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Code Ferret updated JENA-1326:
------------------------------
    Description: 
There are analyzers in the Lucene distribution bundled with jena-text that 
cannot currently be referenced in the assembler configurations. Also, many 
analyzers provide constructors that accept parameters such as stop word sets 
and sets of stem exclusions that are not currently supported. Finally, there 
are analyzers that do not appear in the Lucene distribution that may be needed 
to be used and there is not currently any way to refer to such analyzers 
without modifying the jena-text source.

This issue proposes the addition of a jena-text assembler configuration feature 
to permit the specification of generic Lucene Analyzers given a fully qualified 
Class name and a list of parameters for a constructor of the Class and to allow 
the naming of such specifications for use in the Multilingual feature and use 
in other {{text:analyzer}} specifications.

A {{text:GenericAnalyzer}} specification is similar to other {{text:analyzer}} 
specifications:

{noformat}
           text:analyzer [
               a text:GenericAnalyzer ;
               text:class "org.apache.lucene.analysis.en.EnglishAnalyzer" ;
               text:params (
                    [ text:paramName "stopwords" ;
                      text:paramType text:TypeSet ;
                      text:paramValue ("the" "a" "an") ]
                    [ text:paramName "stemExclusionSet" ;
                      text:paramType text:TypeSet ;
                      text:paramValue ("ing" "ed") ]
                    )
           ] .
{noformat}

The {{text:class}} is the fully qualified class name of the desired analyzer.

The parameters may be of the following types:

|| &nbsp;Type&nbsp;  || &nbsp; Description&nbsp;    ||
|{{text:TypeAnalyzer}}|a subclass of {{org.apache.lucene.analysis.Analyzer}}|
|{{text:TypeBoolean}}|a java {{boolean}}|
|{{text:TypeFile}}|the {{String}} path to a file materialized as a 
{{java.io.FileReader}}|
|{{text:TypeInt}}|a java {{int}}|
|{{text:TypeString}}|a java {{String}}|
|{{text:TypeSet}}|an {{org.apache.lucene.analysis.CharArraySet}}|
 
Although the list of types is not exhaustive it is a simple matter to create a 
wrapper Analyzer that reads a file with information that can be used to 
initialize any sort of parameters that may be needed for a given Analyzer. The 
provided types cover the most common cases.

For example, {{org.apache.lucene.analysis.ja.JapaneseAnalyzer}} has a 
constructor with 4 parameters: a {{UserDict}}, a {{CharArraySet}}, a 
{{JapaneseTokenizer.Mode}}, and a {{Set<String>}}. So a simple wrapper can 
extract the values needed for the various parameters with types not available 
in this extension, construct the required instances, and instantiate the 
{{JapaneseAnalyzer}}.

Adding custom Analyzers such as the above wrapper analyzer is a simple matter 
of adding the Analyzer class and any associated filters and tokenizer and so on 
to the classpath for Jena. Also, all of the Analyzers that are included in the 
Lucene distribution bundled with Jena are available as well.

Each parameter object is specified with:

- an optional {{text:paramName}} that may be used to document which parameter 
is represented
- a {{text:paramType}} which is one of: {{text:TypeAnalyzer, text:TypeBoolean, 
text:TypeFile, text:TypeInt, text:TypeSet, text:TypeString}}.
- a {{text:paramValue}} which is an {{xsd:string}}, {{xsd:boolean}}, 
{{xsd:int}}, or {{Resource}}.

A {{text:TypeSet}} parameter _may have_ zero or more {{text:paramValue}}.

All other parameter types _must have_ a single {{text:paramValue}} of the 
appropriate type.

An example configuration using the {{ShingleAnalyzerWrapper}} is:

{noformat}
    text:map (
         [ text:field "text" ; 
           text:predicate rdfs:label;
           text:analyzer [
               a text:GenericAnalyzer ;
               text:class 
"org.apache.lucene.analysis.shingle.ShingleAnalyzerWrapper" ;
               text:params (
                    [ text:paramName "defaultAnalyzer" ;
                      text:paramType text:TypeAnalyzer ;
                      text:paramValue [ a text:SimpleAnalyzer ]  ]
                    [ text:paramName "maxShingleSize" ;
                      text:paramType text:TypeInt ;
                      text:paramValue 3 ]
                    )
           ] .
{noformat}

The {{text:defineAnalyzers}} feature allows to extend the Multilingual support. 
Further, this feature can also be used to name analyzers defined via 
{{text:GenericAnalyzer}} so that a single (perhaps complex) analyzer 
configuration can be used is several places.

The {{text:defineAnalyzers}} is used with {{text:TextIndexLucene}} to provide a 
list of analyzer
definitions:

{noformat}
    <#indexLucene> a text:TextIndexLucene ;
        text:directory <file:Lucene> ;
        text:entityMap <#entMap> ;
        text:defineAnalyzers (
            [ text:addLang "sa-x-iast" ;
              text:analyzer [ . . . ] ]
            [ text:defineAnalyzer <#foo> ;
              text:analyzer [ . . . ] ]
        )
        .
{noformat}

References to a defined analyzer may be made in the entity map like:

{noformat}
    text:analyzer [
        a text:DefinedAnalyzer
        text:useAnalyzer <#foo> ]
{noformat}

Multilingual support currently allows for a fixed set of  ISO 2-letter codes to 
be used to select from among built-in analyzers using the nullary constructor 
associated with each analyzer. So if one is wanting to use:

* a language not included, e.g., Brazilian; or 
* use additional constructors defining stop words, stem exclusions and so on; 
or 
* refer to custom analyzers that might be associated with generalized BCP-47 
language tags, 
such as, {{sa-x-iast}} for Sanskrit in the IAST transliteration, 

then {{text:defineAnalyzers}} with {{text:addLang}} will add the desired 
analyzers to the multilingual 
support so that fields with the appropriate language tags will use the 
appropriate custom analyzer.

When {{text:defineAnalyzers}} is used with {{text:addLang}} then 
{{text:multilingualSupport}} is implicitly added if not already specified and a 
warning is put in the log:

{noformat}
        text:defineAnalyzers (
            [ text:addLang "sa-x-iast" ;
              text:analyzer [ . . . ] ]
{noformat}

this adds an analyzer to be used when the {{text:langField}} has the value 
{{sa-x-iast}} during indexing
and search.

Repeating a {{text:GenericAnalyzer}} specification for use with multiple fields 
in an entity map
may be cumbersome. The {{text:defineAnalyzer}} is used in an element of a 
{{text:defineAnalyzers}} list to associate a resource with an analyzer so that 
it may be referred to later in a {{text:analyzer}}
object. Assuming that an analyzer definition such as the following has appeared 
among the
{{text:defineAnalyzers}} list:

{noformat}
    [ text:defineAnalyzer <#foo>
      text:analyzer [ . . . ] ]
{noformat}
      
then in a {{text:analyzer}} specification in an entity map, for example, a 
reference to analyzer {{<#foo>}}
is made via:

{noformat}
    text:map (
         [ text:field "text" ; 
           text:predicate rdfs:label;
           text:analyzer [
               a text:DefinedAnalyzer
               text:useAnalyzer <#foo> ]
{noformat}

This makes it straightforward to refer to the same (possibly complex) analyzer 
definition in multiple fields.


  was:
This issue proposes the addition of a jena-text assembler configuration feature 
to permit the specification of generic Lucene Analyzers given a fully qualified 
Class name and a list of parameters for a constructor of the Class.

The parameters may be of the following types:

{noformat}
    string        String
    set           org.apache.lucene.analysis.util.CharArraySet
    file          java.io.FileReader
    int           int
    boolean       boolean
{noformat}
 
Although the list of types is not exhaustive it is a simple matter to create a 
wrapper Analyzer that reads a file with information that can be used to 
initialize any sort of parameters that may be needed for a given Analyzer. The 
provided types cover the most common cases.

For example, {{org.apache.lucene.analysis.ja.JapaneseAnalyzer}} has a 
constructor with 4 parameters: a {{UserDict}}, a {{CharArraySet}}, a 
{{JapaneseTokenizer.Mode}}, and a {{Set<String>}}. So a simple wrapper can 
extract the values needed for the various parameters with types not available 
in this extension, construct the required instances, and instantiate the 
{{JapaneseAnalyzer}}.

Adding custom Analyzers such as the above wrapper analyzer is a simple matter 
of adding the Analyzer class and any associated filters and tokenizer and so on 
to the classpath for Jena. Also, all of the Analyzers that are included in the 
Lucene distribution bundled with Jena are available as well.

Each parameter object is specified with:

- an optional {{text:paramName}} that may be used to document which parameter 
is represented
- a {{text:paramType}} which is one of: {{string, set, file, int, boolean}}.
- a {{text:paramValue}} which is an {{xsd:string}}, {{xsd:boolean}} or 
{{xsd:int}}.

A parameter of type {{set}} _may have_ zero or more {{text:paramValue}}.

A parameter of type {{string, file, boolean}}, or {{int}} _must have_ a single 
{{text:paramValue}}.

An example Analyzer configuration would look like:

{noformat}
    text:map (
         [ text:field "text" ; 
           text:predicate rdfs:label;
           text:analyzer [
               a text:GenericAnalyzer ;
               text:class "org.apache.lucene.analysis.en.EnglishAnalyzer" ;
               text:params (
                    [ text:paramName "stopwords" ;
                      text:paramType "set" ;
                      text:paramValue ("the" "a" "an") ] ;
                    [ text:paramName "stemExclusionSet" ;
                      text:paramType "set" ;
                      text:paramValue ("ing" "ed") ]
                    )
           ]  .
. . .
{noformat}


> Generic Lucene Analyzers
> ------------------------
>
>                 Key: JENA-1326
>                 URL: https://issues.apache.org/jira/browse/JENA-1326
>             Project: Apache Jena
>          Issue Type: New Feature
>          Components: Jena, Text
>    Affects Versions: Jena 3.2.0
>            Reporter: Code Ferret
>              Labels: Lucene, analyzers, jena-text
>             Fix For: Jena 3.3.0
>
>
> There are analyzers in the Lucene distribution bundled with jena-text that 
> cannot currently be referenced in the assembler configurations. Also, many 
> analyzers provide constructors that accept parameters such as stop word sets 
> and sets of stem exclusions that are not currently supported. Finally, there 
> are analyzers that do not appear in the Lucene distribution that may be 
> needed to be used and there is not currently any way to refer to such 
> analyzers without modifying the jena-text source.
> This issue proposes the addition of a jena-text assembler configuration 
> feature to permit the specification of generic Lucene Analyzers given a fully 
> qualified Class name and a list of parameters for a constructor of the Class 
> and to allow the naming of such specifications for use in the Multilingual 
> feature and use in other {{text:analyzer}} specifications.
> A {{text:GenericAnalyzer}} specification is similar to other 
> {{text:analyzer}} specifications:
> {noformat}
>            text:analyzer [
>                a text:GenericAnalyzer ;
>                text:class "org.apache.lucene.analysis.en.EnglishAnalyzer" ;
>                text:params (
>                     [ text:paramName "stopwords" ;
>                       text:paramType text:TypeSet ;
>                       text:paramValue ("the" "a" "an") ]
>                     [ text:paramName "stemExclusionSet" ;
>                       text:paramType text:TypeSet ;
>                       text:paramValue ("ing" "ed") ]
>                     )
>            ] .
> {noformat}
> The {{text:class}} is the fully qualified class name of the desired analyzer.
> The parameters may be of the following types:
> || &nbsp;Type&nbsp;  || &nbsp; Description&nbsp;    ||
> |{{text:TypeAnalyzer}}|a subclass of {{org.apache.lucene.analysis.Analyzer}}|
> |{{text:TypeBoolean}}|a java {{boolean}}|
> |{{text:TypeFile}}|the {{String}} path to a file materialized as a 
> {{java.io.FileReader}}|
> |{{text:TypeInt}}|a java {{int}}|
> |{{text:TypeString}}|a java {{String}}|
> |{{text:TypeSet}}|an {{org.apache.lucene.analysis.CharArraySet}}|
>  
> Although the list of types is not exhaustive it is a simple matter to create 
> a wrapper Analyzer that reads a file with information that can be used to 
> initialize any sort of parameters that may be needed for a given Analyzer. 
> The provided types cover the most common cases.
> For example, {{org.apache.lucene.analysis.ja.JapaneseAnalyzer}} has a 
> constructor with 4 parameters: a {{UserDict}}, a {{CharArraySet}}, a 
> {{JapaneseTokenizer.Mode}}, and a {{Set<String>}}. So a simple wrapper can 
> extract the values needed for the various parameters with types not available 
> in this extension, construct the required instances, and instantiate the 
> {{JapaneseAnalyzer}}.
> Adding custom Analyzers such as the above wrapper analyzer is a simple matter 
> of adding the Analyzer class and any associated filters and tokenizer and so 
> on to the classpath for Jena. Also, all of the Analyzers that are included in 
> the Lucene distribution bundled with Jena are available as well.
> Each parameter object is specified with:
> - an optional {{text:paramName}} that may be used to document which parameter 
> is represented
> - a {{text:paramType}} which is one of: {{text:TypeAnalyzer, 
> text:TypeBoolean, text:TypeFile, text:TypeInt, text:TypeSet, 
> text:TypeString}}.
> - a {{text:paramValue}} which is an {{xsd:string}}, {{xsd:boolean}}, 
> {{xsd:int}}, or {{Resource}}.
> A {{text:TypeSet}} parameter _may have_ zero or more {{text:paramValue}}.
> All other parameter types _must have_ a single {{text:paramValue}} of the 
> appropriate type.
> An example configuration using the {{ShingleAnalyzerWrapper}} is:
> {noformat}
>     text:map (
>          [ text:field "text" ; 
>            text:predicate rdfs:label;
>            text:analyzer [
>                a text:GenericAnalyzer ;
>                text:class 
> "org.apache.lucene.analysis.shingle.ShingleAnalyzerWrapper" ;
>                text:params (
>                     [ text:paramName "defaultAnalyzer" ;
>                       text:paramType text:TypeAnalyzer ;
>                       text:paramValue [ a text:SimpleAnalyzer ]  ]
>                     [ text:paramName "maxShingleSize" ;
>                       text:paramType text:TypeInt ;
>                       text:paramValue 3 ]
>                     )
>            ] .
> {noformat}
> The {{text:defineAnalyzers}} feature allows to extend the Multilingual 
> support. Further, this feature can also be used to name analyzers defined via 
> {{text:GenericAnalyzer}} so that a single (perhaps complex) analyzer 
> configuration can be used is several places.
> The {{text:defineAnalyzers}} is used with {{text:TextIndexLucene}} to provide 
> a list of analyzer
> definitions:
> {noformat}
>     <#indexLucene> a text:TextIndexLucene ;
>         text:directory <file:Lucene> ;
>         text:entityMap <#entMap> ;
>         text:defineAnalyzers (
>             [ text:addLang "sa-x-iast" ;
>               text:analyzer [ . . . ] ]
>             [ text:defineAnalyzer <#foo> ;
>               text:analyzer [ . . . ] ]
>         )
>         .
> {noformat}
> References to a defined analyzer may be made in the entity map like:
> {noformat}
>     text:analyzer [
>         a text:DefinedAnalyzer
>         text:useAnalyzer <#foo> ]
> {noformat}
> Multilingual support currently allows for a fixed set of  ISO 2-letter codes 
> to be used to select from among built-in analyzers using the nullary 
> constructor associated with each analyzer. So if one is wanting to use:
> * a language not included, e.g., Brazilian; or 
> * use additional constructors defining stop words, stem exclusions and so on; 
> or 
> * refer to custom analyzers that might be associated with generalized BCP-47 
> language tags, 
> such as, {{sa-x-iast}} for Sanskrit in the IAST transliteration, 
> then {{text:defineAnalyzers}} with {{text:addLang}} will add the desired 
> analyzers to the multilingual 
> support so that fields with the appropriate language tags will use the 
> appropriate custom analyzer.
> When {{text:defineAnalyzers}} is used with {{text:addLang}} then 
> {{text:multilingualSupport}} is implicitly added if not already specified and 
> a warning is put in the log:
> {noformat}
>         text:defineAnalyzers (
>             [ text:addLang "sa-x-iast" ;
>               text:analyzer [ . . . ] ]
> {noformat}
> this adds an analyzer to be used when the {{text:langField}} has the value 
> {{sa-x-iast}} during indexing
> and search.
> Repeating a {{text:GenericAnalyzer}} specification for use with multiple 
> fields in an entity map
> may be cumbersome. The {{text:defineAnalyzer}} is used in an element of a 
> {{text:defineAnalyzers}} list to associate a resource with an analyzer so 
> that it may be referred to later in a {{text:analyzer}}
> object. Assuming that an analyzer definition such as the following has 
> appeared among the
> {{text:defineAnalyzers}} list:
> {noformat}
>     [ text:defineAnalyzer <#foo>
>       text:analyzer [ . . . ] ]
> {noformat}
>       
> then in a {{text:analyzer}} specification in an entity map, for example, a 
> reference to analyzer {{<#foo>}}
> is made via:
> {noformat}
>     text:map (
>          [ text:field "text" ; 
>            text:predicate rdfs:label;
>            text:analyzer [
>                a text:DefinedAnalyzer
>                text:useAnalyzer <#foo> ]
> {noformat}
> This makes it straightforward to refer to the same (possibly complex) 
> analyzer definition in multiple fields.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (JENA-1326) Generic Lucene Analyzers

Reply via email to