Github user osma commented on the pull request:
https://github.com/apache/jena/pull/97#issuecomment-153981033
Good questions @rvesse !
Right now (before this PR) one can either use a few generic,
non-language-specific Analyzers: StandardAnalyzer, SimpleAnalyzer,
KeywordAnalyzer and LowerCaseKeywordAnalyzer.
Then there is MultilingualAnalyzer, which looks at the language tag of
literals and picks a language-specific Analyzer based on the language tag
(falling back on StandardAnalyzer in case there's no suitable Analyzer
implementation for the language). The list of language-specific Analyzers is
hardwired in the implementation though.
What this adds is a non-language-specific Analyzer that can be configured
in a little bit more detail: it is possible to select a Tokenizer and zero or
more TokenFilters. However, it does not look at the language tags at all and it
is also limited to a few recognized Tokenizers and Filters, none of which
require any special parameters.
Things that were possible before:
* use StandardAnalyzer/SimpleAnalyzer/KeywordAnalyzer for everything
* use EnglishAnalyzer for "en" and FrenchAnalyzer for "fr" literals
(MultilingualAnalyzer does this)
Things that become possible with this PR:
* use KeywordTokenizer (i.e. don't split into tokens), but drop accents
with ASCIIFoldingFilter and make everything lowercase with LowerCaseFilter (my
original use case for JENA-1058)
* use WhitespaceTokenizer without filters (perhaps good for handling e.g. a
whitespace-separated list of product codes or URIs)
* dozens of other combinations of the non-language-specific Tokenizers and
TokenFilters, though probably only some combinations make any sense
Things that are still not possible:
* use EnglishAnalyzer for "en" language but StandardAnalyzer for everything
else (in MultilingualAnalyzer the analyzers are hardwired)
* use language-specific analyzers when available but fall back on
SimpleAnalyzer (ditto, the fallback to StandardAnalyzers in hardwired in
MultilingualAnalyzer)
* use StandardAnalyzer with LengthFilter to remove excessively short or
long words (LengthFilter requires `min` and `max` parameters and there is no
way to pass those parameters to ConfigurableAnalyzer, so it doesn't support
LenghtFilter)
In short, the universe of Analyzers (Tokenizer + TokenFilter combinations,
with or without special treatment for language tags) is potentially huge and
this PR tackles only one rather small part of it, but it expands the options in
a way that I think is useful.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---