[ https://issues.apache.org/jira/browse/OAK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871271#comment-15871271 ]
Chetan Mehrotra commented on OAK-5692: -------------------------------------- bq. What are all the CharFilters/Filters available? Is there a concise list w/ their params? (Ex. i think the PorterStem might support and ignoreCase param?) See javadocs for below and look for subclasses * https://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/util/TokenFilterFactory.html * CharFilters - https://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/util/CharFilterFactory.html * Filters - https://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/util/TokenFilterFactory.html Per PorterStemFilterFactory it does not appear to support ignorecase [1] bq. Are all the options in the link [2] supported. Its unclear if there is a 1:1 between oak lucene and solr's capabilities or if [2] is a loose example of the "types" of supported analyzers. Support for factories was moved from Solr to Lucene. So roughly its 1:1 mapping here [1] https://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilterFactory.html > Oak Lucene analyzers docs unclear on viable configurations > ---------------------------------------------------------- > > Key: OAK-5692 > URL: https://issues.apache.org/jira/browse/OAK-5692 > Project: Jackrabbit Oak > Issue Type: Documentation > Reporter: David Gonzalez > > The Oak lucene docs [1] > Analyzers section would benefit from clarification: > Combining analyzer-based topics into a single ticket > * If no analyzer is specified, what analyzer setup is used (at the vert least > some tokenizer must be used) > * The docs mention the "default" analyzer > ([oak:queryIndexDefinition]/analyzers/default). Can other analyzers be > defined? How are they selected for use? is the selection configurable? > * By default is the analyzer index AND query time, unless specified by > `type=index|query` property? > * What is the naming for multiple analyzer nodes? Are all children of > analyzers assumed to be an analyzer? Ex. If i want a special configuration or > index and another for query, could i create: > {noformat} > ../myIndex/analyzers/indexAnalyzer@type=index > .. define the index-time analyzer ... > ../myIndex/analyzers/queryAnalyzer@type=query > .. define the query-time analyzer ... > {noformat} > * How are languages handled? Ex. language specific stop words, synonyms, char > mapping, and Stemming. > * If > [oak:queryIndexDefinition]/analyzers/default@class=org.apache.lucene.analysis.standard.StandardAnalyzer > it appears the Standard Tokenizer and Standard Lowercase and Stop Filters > are used. The Stop filter can be augmented w the well-named stopwords file. > ** Can other charFilters/filters be layered on top of this "named" Analyzer > (it seems not). > * When the Stop Filter is used it provided the OOTB language-based stop > words. If a custom stopwords file is provided, that list replaced the OOTB > lang-based, requiring the developer to provide their own language based Stop > words. Is this correct? This should be called out and link out to the catalog > of OOTB stopword txt files for easy inclusion) > * The Stop filters words property must be a String not String[] and the value > is a comma delimited String value. Would be good to call this out. > * What are all the CharFilters/Filters available? Is there a concise list w/ > their params? (Ex. i think the PorterStem might support and ignoreCase param?) > * Synonym Filter syntax is unclear; It seems like here are 2 formats; > directional x -> y and bi-directional (comma delimited); i could only get the > latter to work. > * Are all the options in the link [2] supported. Its unclear if there is a > 1:1 between oak lucene and solr's capabilities or if [2] is a loose example > of the "types" of supported analyzers. > * For things something like the PatternReplaceCharFilterFactory [3], how do > you define multiple pattern mappings, as IIUC the charFilter node MUST be > named: > {noformat}.../charFilters/PatternReplace{noformat} so you can't have multiple > "PatternReplace" named nodes, each with its own "@pattern" and "@replace" > properties. It seems like there is only support for a single object for each > Factory type? > Generally this seems like the handiest resource: > https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers%2C+Tokenizers%2C+and+Filters > [1] http://jackrabbit.apache.org/oak/docs/query/lucene.html > [2] > https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Specifying_an_Analyzer_in_the_schema > [3] https://cwiki.apache.org/confluence/display/solr/CharFilterFactories -- This message was sent by Atlassian JIRA (v6.3.15#6346)