[GitHub] jena pull request: JENA-1062: configurable Lucene analyzer for jen...

osma Wed, 04 Nov 2015 23:55:54 -0800

Github user osma commented on the pull request:

    https://github.com/apache/jena/pull/97#issuecomment-153981033
  
    Good questions @rvesse !
    
    Right now (before this PR) one can either use a few generic, 
non-language-specific Analyzers: StandardAnalyzer, SimpleAnalyzer, 
KeywordAnalyzer and LowerCaseKeywordAnalyzer. 
    
    Then there is MultilingualAnalyzer, which looks at the language tag of 
literals and picks a language-specific Analyzer based on the language tag 
(falling back on StandardAnalyzer in case there's no suitable Analyzer 
implementation for the language). The list of language-specific Analyzers is 
hardwired in the implementation though.
    
    What this adds is a non-language-specific Analyzer that can be configured 
in a little bit more detail: it is possible to select a Tokenizer and zero or 
more TokenFilters. However, it does not look at the language tags at all and it 
is also limited to a few recognized Tokenizers and Filters, none of which 
require any special parameters.
    
    Things that were possible before:
    * use StandardAnalyzer/SimpleAnalyzer/KeywordAnalyzer for everything
    * use EnglishAnalyzer for "en" and FrenchAnalyzer for "fr" literals 
(MultilingualAnalyzer does this)
    
    Things that become possible with this PR:
    * use KeywordTokenizer (i.e. don't split into tokens), but drop accents 
with ASCIIFoldingFilter and make everything lowercase with LowerCaseFilter (my 
original use case for JENA-1058)
    * use WhitespaceTokenizer without filters (perhaps good for handling e.g. a 
whitespace-separated list of product codes or URIs)
    * dozens of other combinations of the non-language-specific Tokenizers and 
TokenFilters, though probably only some combinations make any sense
    
    Things that are still not possible:
    * use EnglishAnalyzer for "en" language but StandardAnalyzer for everything 
else (in MultilingualAnalyzer the analyzers are hardwired)
    * use language-specific analyzers when available but fall back on 
SimpleAnalyzer (ditto, the fallback to StandardAnalyzers in hardwired in 
MultilingualAnalyzer)
    * use StandardAnalyzer with LengthFilter to remove excessively short or 
long words (LengthFilter requires `min` and `max` parameters and there is no 
way to pass those parameters to ConfigurableAnalyzer, so it doesn't support 
LenghtFilter)
    
    In short, the universe of Analyzers (Tokenizer + TokenFilter combinations, 
with or without special treatment for language tags) is potentially huge and 
this PR tackles only one rather small part of it, but it expands the options in 
a way that I think is useful.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] jena pull request: JENA-1062: configurable Lucene analyzer for jen...

Reply via email to