[jira] [Commented] (OAK-5692) Oak Lucene analyzers docs unclear on viable configurations

Chetan Mehrotra (JIRA) Thu, 16 Feb 2017 22:40:32 -0800

    [ 
https://issues.apache.org/jira/browse/OAK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871271#comment-15871271
 ]


Chetan Mehrotra commented on OAK-5692:
--------------------------------------

bq. What are all the CharFilters/Filters available? Is there a concise list w/ 
their params? (Ex. i think the PorterStem might support and ignoreCase param?)

See javadocs for below and look for subclasses
* 
https://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/util/TokenFilterFactory.html
* CharFilters - 
https://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/util/CharFilterFactory.html
* Filters - 
https://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/util/TokenFilterFactory.html

Per PorterStemFilterFactory it does not appear to support ignorecase [1]

bq. Are all the options in the link [2] supported. Its unclear if there is a 
1:1 between oak lucene and solr's capabilities or if [2] is a loose example of 
the "types" of supported analyzers.

Support for factories was moved from Solr to Lucene. So roughly its 1:1 mapping 
here
[1] 
https://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilterFactory.html

> Oak Lucene analyzers docs unclear on viable configurations
> ----------------------------------------------------------
>
>                 Key: OAK-5692
>                 URL: https://issues.apache.org/jira/browse/OAK-5692
>             Project: Jackrabbit Oak
>          Issue Type: Documentation
>            Reporter: David Gonzalez
>
> The Oak lucene docs [1] > Analyzers section would benefit from clarification:
> Combining analyzer-based topics into a single ticket
> * If no analyzer is specified, what analyzer setup is used (at the vert least 
> some tokenizer must be used)
> * The docs mention the "default" analyzer 
> ([oak:queryIndexDefinition]/analyzers/default). Can other analyzers be 
> defined? How are they selected for use? is the selection configurable?
> * By default is the analyzer index AND query time, unless specified by 
> `type=index|query` property?
> * What is the naming for multiple analyzer nodes? Are all children of 
> analyzers assumed to be an analyzer? Ex. If i want a special configuration or 
> index and another for query, could i create:
> {noformat}
> ../myIndex/analyzers/indexAnalyzer@type=index
> .. define the index-time analyzer ...
> ../myIndex/analyzers/queryAnalyzer@type=query
> .. define the query-time analyzer ...
> {noformat}
> * How are languages handled? Ex. language specific stop words, synonyms, char 
> mapping,  and Stemming.
> * If 
> [oak:queryIndexDefinition]/analyzers/default@class=org.apache.lucene.analysis.standard.StandardAnalyzer
>  it appears the Standard Tokenizer and Standard Lowercase and Stop Filters 
> are used. The Stop filter can be augmented w the well-named stopwords file.
> ** Can other charFilters/filters be layered on top of this "named" Analyzer 
> (it seems not).
> * When the Stop Filter is used it provided the OOTB language-based stop 
> words. If a custom stopwords file is provided, that list replaced the OOTB 
> lang-based, requiring the developer to provide their own language based Stop 
> words. Is this correct? This should be called out and link out to the catalog 
> of OOTB stopword txt files for easy inclusion)
> * The Stop filters words property must be a String not String[] and the value 
> is a comma delimited String value. Would be good to call this out.
> * What are all the CharFilters/Filters available? Is there a concise list w/ 
> their params? (Ex. i think the PorterStem might support and ignoreCase param?)
> * Synonym Filter syntax is unclear; It seems like here are 2 formats; 
> directional x -> y and bi-directional (comma delimited); i could only get the 
> latter to work.
> * Are all the options in the link [2] supported. Its unclear if there is a 
> 1:1 between oak lucene and solr's capabilities or if [2] is a loose example 
> of the "types" of supported analyzers.
> * For things something like the PatternReplaceCharFilterFactory [3], how do 
> you define multiple pattern mappings, as IIUC the charFilter node MUST be 
> named:
> {noformat}.../charFilters/PatternReplace{noformat} so you can't have multiple 
> "PatternReplace" named nodes, each with its own "@pattern" and "@replace" 
> properties.  It seems like there is only support for a single object for each 
> Factory type?
> Generally this seems like the handiest resource: 
> https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers%2C+Tokenizers%2C+and+Filters
> [1]  http://jackrabbit.apache.org/oak/docs/query/lucene.html
> [2] 
> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Specifying_an_Analyzer_in_the_schema
> [3] https://cwiki.apache.org/confluence/display/solr/CharFilterFactories



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (OAK-5692) Oak Lucene analyzers docs unclear on viable configurations

Reply via email to