[jira] [Commented] (JENA-1326) Generic Lucene Analyzers

ASF GitHub Bot (JIRA) Sun, 23 Apr 2017 09:29:39 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15980456#comment-15980456
 ]


ASF GitHub Bot commented on JENA-1326:
--------------------------------------

GitHub user xristy opened a pull request:

    https://github.com/apache/jena/pull/246

    Generic text analyzers

    This PR implements JENA-1326 and includes some tests.
    
    Updates to 
[text-query.mdtext.zip](https://github.com/apache/jena/files/949831/text-query.mdtext.zip)
 are attached.
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/BuddhistDigitalResourceCenter/jena 
generic-text-analyzers

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/jena/pull/246.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #246
    
----
commit 1440e81d75ee01baf874c407d5f0017bc59c6787
Author: Chris Tomlinson <[email protected]>
Date:   2017-04-17T19:53:41Z

    initial commit for generic analyzers

commit 8b3757bae52d08d4b308bd0f996ff452c60cc7c9
Author: Chris Tomlinson <[email protected]>
Date:   2017-04-19T19:43:04Z

    initial documentation

commit 27ea30b73855d7a3cf0cd9561d2089295ec03353
Author: Chris Tomlinson <[email protected]>
Date:   2017-04-20T20:37:00Z

    implement GenericAnalyzerAssembler. TO DO: Tests

commit c429c1d8e6fa7a7473001289985a9265fcc4ff37
Author: Chris Tomlinson <[email protected]>
Date:   2017-04-21T13:56:48Z

    Merge remote-tracking branch 'apache/master' into generic-text-analyzers
    grabbing updates to jena-text

commit 8f1fa7ccbf2cb05f2eed121831c39e07260ec18b
Author: Chris Tomlinson <[email protected]>
Date:   2017-04-21T21:02:20Z

    adding GenericAnalyzer tests

commit d2f0561b99c957658261b3693e4a89892369a65a
Author: Chris Tomlinson <[email protected]>
Date:   2017-04-22T17:29:04Z

    added parameters of type org.apache.lucene.analysis.Analyzer

commit 94b41be7553a4f955c0e41c868d94662bdd7236e
Author: Chris Tomlinson <[email protected]>
Date:   2017-04-22T17:29:47Z

    added more tests

commit 57ded6a9c1f7d275de4f8e6294611a869407534d
Author: Chris Tomlinson <[email protected]>
Date:   2017-04-22T21:15:58Z

    ignore: organize imports

commit a3bb8e41aeaf9be3540cf0a6be84cd9dc9b43b28
Author: Chris Tomlinson <[email protected]>
Date:   2017-04-22T21:31:54Z

    added analyzer definitions: 1) DefinedAnalyzers for use in text:map; 2)
    add analyzers to Multilingual support based on BCP47 codes

commit 311efab2fd26a58406b29b64d74b41039292d080
Author: Chris Tomlinson <[email protected]>
Date:   2017-04-23T14:18:35Z

    represent parameter types as resources like text:TypeSet instead of
    literal string

commit 5edb6c8758124fe8dd5a96d7b92949fc3ac1f61f
Author: Chris Tomlinson <[email protected]>
Date:   2017-04-23T15:13:09Z

    factor DefinedAnalyzerAssembler and DefineAnalyzersAssembler into
    separate classes; move defined analyzer cache to Utils along side the
    language tagged analyzers since both caches have the same lifetime and
    similar uses.

commit cc579790631d92499ef44717e540ceaefcfe1d89
Author: Chris Tomlinson <[email protected]>
Date:   2017-04-23T16:17:20Z

    Merge remote-tracking branch 'apache/master' into generic-text-analyzers
    merging jena-tex-es changes

commit fef4d22faeda09159cc2523e477571d1d23a85e7
Author: Chris Tomlinson <[email protected]>
Date:   2017-04-23T16:20:53Z

    ignore extras

----


> Generic Lucene Analyzers
> ------------------------
>
>                 Key: JENA-1326
>                 URL: https://issues.apache.org/jira/browse/JENA-1326
>             Project: Apache Jena
>          Issue Type: New Feature
>          Components: Jena, Text
>    Affects Versions: Jena 3.2.0
>            Reporter: Code Ferret
>              Labels: Lucene, analyzers, jena-text
>             Fix For: Jena 3.3.0
>
>
> There are analyzers in the Lucene distribution bundled with jena-text that 
> cannot currently be referenced in the assembler configurations. Also, many 
> analyzers provide constructors that accept parameters such as stop word sets 
> and sets of stem exclusions that are not currently supported. Finally, there 
> are analyzers that do not appear in the Lucene distribution that may be 
> needed to be used and there is not currently any way to refer to such 
> analyzers without modifying the jena-text source.
> This issue proposes the addition of a jena-text assembler configuration 
> feature to permit the specification of generic Lucene Analyzers given a fully 
> qualified Class name and a list of parameters for a constructor of the Class 
> and to allow the naming of such specifications for use in the Multilingual 
> feature and use in other {{text:analyzer}} specifications.
> A {{text:GenericAnalyzer}} specification is similar to other 
> {{text:analyzer}} specifications:
> {noformat}
>            text:analyzer [
>                a text:GenericAnalyzer ;
>                text:class "org.apache.lucene.analysis.en.EnglishAnalyzer" ;
>                text:params (
>                     [ text:paramName "stopwords" ;
>                       text:paramType text:TypeSet ;
>                       text:paramValue ("the" "a" "an") ]
>                     [ text:paramName "stemExclusionSet" ;
>                       text:paramType text:TypeSet ;
>                       text:paramValue ("ing" "ed") ]
>                     )
>            ] .
> {noformat}
> The {{text:class}} is the fully qualified class name of the desired analyzer.
> The parameters may be of the following types:
> || &nbsp;Type&nbsp;  || &nbsp; Description&nbsp;    ||
> |{{text:TypeAnalyzer}}|a subclass of {{org.apache.lucene.analysis.Analyzer}}|
> |{{text:TypeBoolean}}|a java {{boolean}}|
> |{{text:TypeFile}}|the {{String}} path to a file materialized as a 
> {{java.io.FileReader}}|
> |{{text:TypeInt}}|a java {{int}}|
> |{{text:TypeString}}|a java {{String}}|
> |{{text:TypeSet}}|an {{org.apache.lucene.analysis.CharArraySet}}|
>  
> Although the list of types is not exhaustive it is a simple matter to create 
> a wrapper Analyzer that reads a file with information that can be used to 
> initialize any sort of parameters that may be needed for a given Analyzer. 
> The provided types cover the most common cases.
> For example, {{org.apache.lucene.analysis.ja.JapaneseAnalyzer}} has a 
> constructor with 4 parameters: a {{UserDict}}, a {{CharArraySet}}, a 
> {{JapaneseTokenizer.Mode}}, and a {{Set<String>}}. So a simple wrapper can 
> extract the values needed for the various parameters with types not available 
> in this extension, construct the required instances, and instantiate the 
> {{JapaneseAnalyzer}}.
> Adding custom Analyzers such as the above wrapper analyzer is a simple matter 
> of adding the Analyzer class and any associated filters and tokenizer and so 
> on to the classpath for Jena. Also, all of the Analyzers that are included in 
> the Lucene distribution bundled with Jena are available as well.
> Each parameter object is specified with:
> - an optional {{text:paramName}} that may be used to document which parameter 
> is represented
> - a {{text:paramType}} which is one of: {{text:TypeAnalyzer, 
> text:TypeBoolean, text:TypeFile, text:TypeInt, text:TypeSet, 
> text:TypeString}}.
> - a {{text:paramValue}} which is an {{xsd:string}}, {{xsd:boolean}}, 
> {{xsd:int}}, or {{Resource}}.
> A {{text:TypeSet}} parameter _may have_ zero or more {{text:paramValue}}.
> All other parameter types _must have_ a single {{text:paramValue}} of the 
> appropriate type.
> An example configuration using the {{ShingleAnalyzerWrapper}} is:
> {noformat}
>     text:map (
>          [ text:field "text" ; 
>            text:predicate rdfs:label;
>            text:analyzer [
>                a text:GenericAnalyzer ;
>                text:class 
> "org.apache.lucene.analysis.shingle.ShingleAnalyzerWrapper" ;
>                text:params (
>                     [ text:paramName "defaultAnalyzer" ;
>                       text:paramType text:TypeAnalyzer ;
>                       text:paramValue [ a text:SimpleAnalyzer ]  ]
>                     [ text:paramName "maxShingleSize" ;
>                       text:paramType text:TypeInt ;
>                       text:paramValue 3 ]
>                     )
>            ] .
> {noformat}
> The {{text:defineAnalyzers}} feature allows to extend the Multilingual 
> support. Further, this feature can also be used to name analyzers defined via 
> {{text:GenericAnalyzer}} so that a single (perhaps complex) analyzer 
> configuration can be used is several places.
> The {{text:defineAnalyzers}} is used with {{text:TextIndexLucene}} to provide 
> a list of analyzer
> definitions:
> {noformat}
>     <#indexLucene> a text:TextIndexLucene ;
>         text:directory <file:Lucene> ;
>         text:entityMap <#entMap> ;
>         text:defineAnalyzers (
>             [ text:addLang "sa-x-iast" ;
>               text:analyzer [ . . . ] ]
>             [ text:defineAnalyzer <#foo> ;
>               text:analyzer [ . . . ] ]
>         )
>         .
> {noformat}
> References to a defined analyzer may be made in the entity map like:
> {noformat}
>     text:analyzer [
>         a text:DefinedAnalyzer
>         text:useAnalyzer <#foo> ]
> {noformat}
> Multilingual support currently allows for a fixed set of  ISO 2-letter codes 
> to be used to select from among built-in analyzers using the nullary 
> constructor associated with each analyzer. So if one is wanting to use:
> * a language not included, e.g., Brazilian; or 
> * use additional constructors defining stop words, stem exclusions and so on; 
> or 
> * refer to custom analyzers that might be associated with generalized BCP-47 
> language tags, 
> such as, {{sa-x-iast}} for Sanskrit in the IAST transliteration, 
> then {{text:defineAnalyzers}} with {{text:addLang}} will add the desired 
> analyzers to the multilingual 
> support so that fields with the appropriate language tags will use the 
> appropriate custom analyzer.
> When {{text:defineAnalyzers}} is used with {{text:addLang}} then 
> {{text:multilingualSupport}} is implicitly added if not already specified and 
> a warning is put in the log:
> {noformat}
>         text:defineAnalyzers (
>             [ text:addLang "sa-x-iast" ;
>               text:analyzer [ . . . ] ]
> {noformat}
> this adds an analyzer to be used when the {{text:langField}} has the value 
> {{sa-x-iast}} during indexing
> and search.
> Repeating a {{text:GenericAnalyzer}} specification for use with multiple 
> fields in an entity map
> may be cumbersome. The {{text:defineAnalyzer}} is used in an element of a 
> {{text:defineAnalyzers}} list to associate a resource with an analyzer so 
> that it may be referred to later in a {{text:analyzer}}
> object. Assuming that an analyzer definition such as the following has 
> appeared among the
> {{text:defineAnalyzers}} list:
> {noformat}
>     [ text:defineAnalyzer <#foo>
>       text:analyzer [ . . . ] ]
> {noformat}
>       
> then in a {{text:analyzer}} specification in an entity map, for example, a 
> reference to analyzer {{<#foo>}}
> is made via:
> {noformat}
>     text:map (
>          [ text:field "text" ; 
>            text:predicate rdfs:label;
>            text:analyzer [
>                a text:DefinedAnalyzer
>                text:useAnalyzer <#foo> ]
> {noformat}
> This makes it straightforward to refer to the same (possibly complex) 
> analyzer definition in multiple fields.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (JENA-1326) Generic Lucene Analyzers

Reply via email to