[jira] [Updated] (JENA-1326) Generic Lucene Analyzers

Code Ferret (JIRA) Thu, 20 Apr 2017 11:36:30 -0700

     [ 
https://issues.apache.org/jira/browse/JENA-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Code Ferret updated JENA-1326:
------------------------------
    Description: 
This issue proposes the addition of a jena-text assembler configuration feature 
to permit the specification of generic Lucene Analyzers given a fully qualified 
Class name and a list of parameters for a constructor of the Class.

The parameters may be of the following types:

{noformat}
    string        String
    set           org.apache.lucene.analysis.util.CharArraySet
    file          java.io.FileReader
    int           int
    boolean       boolean
{noformat}
 
Although the list of types is not exhaustive it is a simple matter to create a 
wrapper Analyzer that reads a file with information that can be used to 
initialize any sort of parameters that may be needed for a given Analyzer. The 
provided types cover the most common cases.

For example, {{org.apache.lucene.analysis.ja.JapaneseAnalyzer}} has a 
constructor with 4 parameters: a {{UserDict}}, a {{CharArraySet}}, a 
{{JapaneseTokenizer.Mode}}, and a {{Set<String>}}. So a simple wrapper can 
extract the values needed for the various parameters with types not available 
in this extension, construct the required instances, and instantiate the 
{{JapaneseAnalyzer}}.

Adding custom Analyzers such as the above wrapper analyzer is a simple matter 
of adding the Analyzer class and any associated filters and tokenizer and so on 
to the classpath for Jena. Also, all of the Analyzers that are included in the 
Lucene distribution bundled with Jena are available as well.

Each parameter object is specified with:

- an optional {{text:paramName}} that may be used to document which parameter 
is represented
- a {{text:paramType}} which is one of: {{string, set, file, int, boolean}}.
- a {{text:paramValue}} which is an {{xsd:string}}, {{xsd:boolean}} or 
{{xsd:int}}.

A parameter of type {{set}} _may have_ zero or more {{text:paramValue}}.

A parameter of type {{string, file, boolean}}, or {{int}} _must have_ a single 
{{text:paramValue}}.

An example Analyzer configuration would look like:

{noformat}
    text:map (
         [ text:field "text" ; 
           text:predicate rdfs:label;
           text:analyzer [
               a text:GenericAnalyzer ;
               text:class "org.apache.lucene.analysis.en.EnglishAnalyzer" ;
               text:params (
                    [ text:paramName "stopwords" ;
                      text:paramType "set" ;
                      text:paramValue ("the" "a" "an") ] ;
                    [ text:paramName "stemExclusionSet" ;
                      text:paramType "set" ;
                      text:paramValue ("ing" "ed") ]
                    )
           ]  .
. . .
{noformat}

  was:
This issue proposes the addition of a jena-text assembler configuration feature 
to permit the specification of generic Lucene Analyzers given a fully qualified 
Class name and a list of parameters for a constructor of the Class.

The parameters may be of the following types:

{noformat}
    string        String
    set           org.apache.lucene.analysis.util.CharArraySet
    file          java.io.FileReader
    int           int
    boolean       boolean
{noformat}
 
Although the list of types is not exhaustive it is a simple matter to create a 
wrapper Analyzer that reads a file with information that can be used to 
initialize any sort of parameters that may be needed for a given Analyzer. The 
provided types cover the most common cases.

For example, {{org.apache.lucene.analysis.ja.JapaneseAnalyzer}} has a 
constructor with 4 parameters: a {{UserDict}}, a {{CharArraySet}}, a 
{{JapaneseTokenizer.Mode}}, and a {{Set<String>}}. So a simple wrapper can 
extract the values needed for the various parameters with types not available 
in this extension, construct the required instances, and instantiate the 
{{JapaneseAnalyzer}}.

Adding custom Analyzers such as the above wrapper analyzer is a simple matter 
of adding the Analyzer class and any associated filters and tokenizer and so on 
to the classpath for Jena. Also, all of the Analyzers that are included in the 
Lucene distribution bundled with Jena are available as well.

Each parameter object is specified with:

- an optional {{text:paramName}} that may be used to document which parameter 
is represented
- a {{text:paramType}} which is one of: {{string, set, file, int, boolean}}.
- a {{text:paramValue}} which is an {{xsd:string}}, {{xsd:boolean}} or 
{{xsd:int}}.

A parameter of type {{set}} _may have_ zero or more {{text:paramValue}}.

A parameter of type {{string, file, boolean}}, or {{int}} _must have_ a single 
{{text:paramValue}}.

An example Analyzer configuration would look like:

{noformat}
    text:map (
         [ text:field "text" ; 
           text:predicate rdfs:label;
           text:analyzer [
               a text:GenericAnalyzer ;
               text:class "org.apache.lucene.analysis.en.EnglishAnalyzer" ;
               text:params [
                    a rdf:seq ;
                    rdf:_1 [
                        text:paramName "stopwords" ;
                        text:paramType "set" ;
                        text:paramValue ("the" "a" "an") ] ;
                    rdf:_2 [
                        text:paramName "stemExclusionSet" ;
                        text:paramType "set" ;
                        text:paramValue ("ing" "ed") ]
                    ]
                ]
          ] .
. . .
{noformat}


> Generic Lucene Analyzers
> ------------------------
>
>                 Key: JENA-1326
>                 URL: https://issues.apache.org/jira/browse/JENA-1326
>             Project: Apache Jena
>          Issue Type: New Feature
>          Components: Jena, Text
>    Affects Versions: Jena 3.2.0
>            Reporter: Code Ferret
>              Labels: Lucene, analyzers, jena-text
>
> This issue proposes the addition of a jena-text assembler configuration 
> feature to permit the specification of generic Lucene Analyzers given a fully 
> qualified Class name and a list of parameters for a constructor of the Class.
> The parameters may be of the following types:
> {noformat}
>     string        String
>     set           org.apache.lucene.analysis.util.CharArraySet
>     file          java.io.FileReader
>     int           int
>     boolean       boolean
> {noformat}
>  
> Although the list of types is not exhaustive it is a simple matter to create 
> a wrapper Analyzer that reads a file with information that can be used to 
> initialize any sort of parameters that may be needed for a given Analyzer. 
> The provided types cover the most common cases.
> For example, {{org.apache.lucene.analysis.ja.JapaneseAnalyzer}} has a 
> constructor with 4 parameters: a {{UserDict}}, a {{CharArraySet}}, a 
> {{JapaneseTokenizer.Mode}}, and a {{Set<String>}}. So a simple wrapper can 
> extract the values needed for the various parameters with types not available 
> in this extension, construct the required instances, and instantiate the 
> {{JapaneseAnalyzer}}.
> Adding custom Analyzers such as the above wrapper analyzer is a simple matter 
> of adding the Analyzer class and any associated filters and tokenizer and so 
> on to the classpath for Jena. Also, all of the Analyzers that are included in 
> the Lucene distribution bundled with Jena are available as well.
> Each parameter object is specified with:
> - an optional {{text:paramName}} that may be used to document which parameter 
> is represented
> - a {{text:paramType}} which is one of: {{string, set, file, int, boolean}}.
> - a {{text:paramValue}} which is an {{xsd:string}}, {{xsd:boolean}} or 
> {{xsd:int}}.
> A parameter of type {{set}} _may have_ zero or more {{text:paramValue}}.
> A parameter of type {{string, file, boolean}}, or {{int}} _must have_ a 
> single {{text:paramValue}}.
> An example Analyzer configuration would look like:
> {noformat}
>     text:map (
>          [ text:field "text" ; 
>            text:predicate rdfs:label;
>            text:analyzer [
>                a text:GenericAnalyzer ;
>                text:class "org.apache.lucene.analysis.en.EnglishAnalyzer" ;
>                text:params (
>                     [ text:paramName "stopwords" ;
>                       text:paramType "set" ;
>                       text:paramValue ("the" "a" "an") ] ;
>                     [ text:paramName "stemExclusionSet" ;
>                       text:paramType "set" ;
>                       text:paramValue ("ing" "ed") ]
>                     )
>            ]  .
> . . .
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (JENA-1326) Generic Lucene Analyzers

Reply via email to