[jira] [Updated] (OAK-2177) Configurable Analyzer in Lucene index

Chetan Mehrotra (JIRA) Sun, 07 Dec 2014 23:31:33 -0800

     [ 
https://issues.apache.org/jira/browse/OAK-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chetan Mehrotra updated OAK-2177:
---------------------------------
    Attachment: OAK-2177.patch

h2. NodeState content based analyzer creation

Analyzers can be configured as part of index definition via {{analyzers}} node. 
The default analyzer can be configured via {{analyzers/default}} node

{noformat}
+ sampleIndex
    - jcr:primaryType = "oak:QueryIndexDefinition"
    + analyzers
        + default
        + pathText
        ...
{noformat}

h3. Specify analyzer class directly

If any of the out of the box analyzer is to be used then it can configured 
directly

{noformat}
+ analyzers
        + default
            - class = "org.apache.lucene.analysis.standard.StandardAnalyzer"
            - luceneMatchVersion = "LUCENE_47" (optional)
{noformat}

To confirm to speicifc version specify it via {{luceneMatchVersion}} otherwise 
Oak would use a default version depending on version of Lucene it is shipped 
with.

One can also provide a stopword file via {{stopwords}} {{nt:file}} node under 
the analyzer node

{noformat}
+ analyzers
        + default
            - class = "org.apache.lucene.analysis.standard.StandardAnalyzer"
            - luceneMatchVersion = "LUCENE_47" (optional)
            + stopwords (nt:file)
{noformat}

h3. Create analyzer via composition

Analyzers can also be composed based on {{Tokenizers}}, {{TokenFilters}} and 
{{CharFilters}}. This is similar to the support provided in Solr where you can 
configure analyzers in xml [1]

{noformat}
+ analyzers
        + default
            + charFilters (nt:unstructured) //The filters needs to be ordered
                + HTMLStrip
                + Mapping
            + tokenizer
                - name = "Standard"
            + filters (nt:unstructured) //The filters needs to be ordered
                + LowerCase
                + Stop
                    - stopWordFiles = "stop1.txt, stop2.txt"
                    + stop1.txt (nt:file)
                    + stop2.txt (nt:file)
                + PorterStem
{noformat}

Point to note
* Name of filters, charFilters and tokenizer are formed by removing the factory 
suffixes. So
** org.apache.lucene.analysis.standard.StandardTokenizerFactory -> standard
** org.apache.lucene.analysis.charfilter.MappingCharFilterFactory -> Mapping
** org.apache.lucene.analysis.core.StopFilterFactory -> Stop
* Any config parameter required for the factory is specified as property of 
that node
* If the factory requires to load a file e.g. stop words from some file then 
file content can be provided via creating child {{nt:file}} node of the 
filename 

[1] 
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Specifying_an_Analyzer_in_the_schema
 


> Configurable Analyzer in Lucene index
> -------------------------------------
>
>                 Key: OAK-2177
>                 URL: https://issues.apache.org/jira/browse/OAK-2177
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: oak-lucene
>    Affects Versions: 1.1.0
>            Reporter: Tommaso Teofili
>            Assignee: Chetan Mehrotra
>         Attachments: OAK-2177.patch
>
>
> Currently the _OakAnalyzer_ is used by default for each Lucene field, 
> sometimes using a different analyzer is needed though.
> It should be possible to make that configurable to support things like: 
> multiple languages, stopword filtering, synonyms expansion, stemming, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-2177) Configurable Analyzer in Lucene index

Reply via email to