[
https://issues.apache.org/jira/browse/JENA-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067963#comment-16067963
]
ASF subversion and git services commented on JENA-1326:
-------------------------------------------------------
Commit 1800246 from [~andy.seaborne] in branch 'site/trunk'
[ https://svn.apache.org/r1800246 ]
JENA-1326: Revised text search documentation
> Generic Lucene Analyzers
> ------------------------
>
> Key: JENA-1326
> URL: https://issues.apache.org/jira/browse/JENA-1326
> Project: Apache Jena
> Issue Type: New Feature
> Components: Jena, Text
> Affects Versions: Jena 3.2.0
> Reporter: Code Ferret
> Assignee: Andy Seaborne
> Labels: Lucene, analyzers, jena-text
> Fix For: Jena 3.4.0
>
>
> There are analyzers in the Lucene distribution bundled with jena-text that
> cannot currently be referenced in the assembler configurations. Also, many
> analyzers provide constructors that accept parameters such as stop word sets
> and sets of stem exclusions that are not currently supported. Finally, there
> are analyzers that do not appear in the Lucene distribution that may be
> needed to be used and there is not currently any way to refer to such
> analyzers without modifying the jena-text source.
> This issue proposes the addition of a jena-text assembler configuration
> feature to permit the specification of generic Lucene Analyzers given a fully
> qualified Class name and a list of parameters for a constructor of the Class
> and to allow the naming of such specifications for use in the Multilingual
> feature and use in other {{text:analyzer}} specifications.
> A {{text:GenericAnalyzer}} specification is similar to other
> {{text:analyzer}} specifications:
> {noformat}
> text:analyzer [
> a text:GenericAnalyzer ;
> text:class "org.apache.lucene.analysis.en.EnglishAnalyzer" ;
> text:params (
> [ text:paramName "stopwords" ;
> text:paramType text:TypeSet ;
> text:paramValue ("the" "a" "an") ]
> [ text:paramName "stemExclusionSet" ;
> text:paramType text:TypeSet ;
> text:paramValue ("ing" "ed") ]
> )
> ] .
> {noformat}
> The {{text:class}} is the fully qualified class name of the desired analyzer.
> The parameters may be of the following types:
> || Type || Description ||
> |{{text:TypeAnalyzer}}|a subclass of {{org.apache.lucene.analysis.Analyzer}}|
> |{{text:TypeBoolean}}|a java {{boolean}}|
> |{{text:TypeFile}}|the {{String}} path to a file materialized as a
> {{java.io.FileReader}}|
> |{{text:TypeInt}}|a java {{int}}|
> |{{text:TypeString}}|a java {{String}}|
> |{{text:TypeSet}}|an {{org.apache.lucene.analysis.CharArraySet}}|
>
> Although the list of types is not exhaustive it is a simple matter to create
> a wrapper Analyzer that reads a file with information that can be used to
> initialize any sort of parameters that may be needed for a given Analyzer.
> The provided types cover the most common cases.
> For example, {{org.apache.lucene.analysis.ja.JapaneseAnalyzer}} has a
> constructor with 4 parameters: a {{UserDict}}, a {{CharArraySet}}, a
> {{JapaneseTokenizer.Mode}}, and a {{Set<String>}}. So a simple wrapper can
> extract the values needed for the various parameters with types not available
> in this extension, construct the required instances, and instantiate the
> {{JapaneseAnalyzer}}.
> Adding custom Analyzers such as the above wrapper analyzer is a simple matter
> of adding the Analyzer class and any associated filters and tokenizer and so
> on to the classpath for Jena. Also, all of the Analyzers that are included in
> the Lucene distribution bundled with Jena are available as well.
> Each parameter object is specified with:
> - an optional {{text:paramName}} that may be used to document which parameter
> is represented
> - a {{text:paramType}} which is one of: {{text:TypeAnalyzer,
> text:TypeBoolean, text:TypeFile, text:TypeInt, text:TypeSet,
> text:TypeString}}.
> - a {{text:paramValue}} which is an {{xsd:string}}, {{xsd:boolean}},
> {{xsd:int}}, or {{Resource}}.
> A {{text:TypeSet}} parameter _may have_ zero or more {{text:paramValue}}.
> All other parameter types _must have_ a single {{text:paramValue}} of the
> appropriate type.
> An example configuration using the {{ShingleAnalyzerWrapper}} is:
> {noformat}
> text:map (
> [ text:field "text" ;
> text:predicate rdfs:label;
> text:analyzer [
> a text:GenericAnalyzer ;
> text:class
> "org.apache.lucene.analysis.shingle.ShingleAnalyzerWrapper" ;
> text:params (
> [ text:paramName "defaultAnalyzer" ;
> text:paramType text:TypeAnalyzer ;
> text:paramValue [ a text:SimpleAnalyzer ] ]
> [ text:paramName "maxShingleSize" ;
> text:paramType text:TypeInt ;
> text:paramValue 3 ]
> )
> ] .
> {noformat}
> The {{text:defineAnalyzers}} feature allows to extend the Multilingual
> support. Further, this feature can also be used to name analyzers defined via
> {{text:GenericAnalyzer}} so that a single (perhaps complex) analyzer
> configuration can be used is several places.
> The {{text:defineAnalyzers}} is used with {{text:TextIndexLucene}} to provide
> a list of analyzer
> definitions:
> {noformat}
> <#indexLucene> a text:TextIndexLucene ;
> text:directory <file:Lucene> ;
> text:entityMap <#entMap> ;
> text:defineAnalyzers (
> [ text:addLang "sa-x-iast" ;
> text:analyzer [ . . . ] ]
> [ text:defineAnalyzer <#foo> ;
> text:analyzer [ . . . ] ]
> )
> .
> {noformat}
> References to a defined analyzer may be made in the entity map like:
> {noformat}
> text:analyzer [
> a text:DefinedAnalyzer
> text:useAnalyzer <#foo> ]
> {noformat}
> Multilingual support currently allows for a fixed set of ISO 2-letter codes
> to be used to select from among built-in analyzers using the nullary
> constructor associated with each analyzer. So if one is wanting to use:
> * a language not included, e.g., Brazilian; or
> * use additional constructors defining stop words, stem exclusions and so on;
> or
> * refer to custom analyzers that might be associated with generalized BCP-47
> language tags,
> such as, {{sa-x-iast}} for Sanskrit in the IAST transliteration,
> then {{text:defineAnalyzers}} with {{text:addLang}} will add the desired
> analyzers to the multilingual
> support so that fields with the appropriate language tags will use the
> appropriate custom analyzer.
> When {{text:defineAnalyzers}} is used with {{text:addLang}} then
> {{text:multilingualSupport}} is implicitly added if not already specified and
> a warning is put in the log:
> {noformat}
> text:defineAnalyzers (
> [ text:addLang "sa-x-iast" ;
> text:analyzer [ . . . ] ]
> {noformat}
> this adds an analyzer to be used when the {{text:langField}} has the value
> {{sa-x-iast}} during indexing
> and search.
> Repeating a {{text:GenericAnalyzer}} specification for use with multiple
> fields in an entity map
> may be cumbersome. The {{text:defineAnalyzer}} is used in an element of a
> {{text:defineAnalyzers}} list to associate a resource with an analyzer so
> that it may be referred to later in a {{text:analyzer}}
> object. Assuming that an analyzer definition such as the following has
> appeared among the
> {{text:defineAnalyzers}} list:
> {noformat}
> [ text:defineAnalyzer <#foo>
> text:analyzer [ . . . ] ]
> {noformat}
>
> then in a {{text:analyzer}} specification in an entity map, for example, a
> reference to analyzer {{<#foo>}}
> is made via:
> {noformat}
> text:map (
> [ text:field "text" ;
> text:predicate rdfs:label;
> text:analyzer [
> a text:DefinedAnalyzer
> text:useAnalyzer <#foo> ]
> {noformat}
> This makes it straightforward to refer to the same (possibly complex)
> analyzer definition in multiple fields.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)