[jira] [Created] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Erick Erickson (Created) (JIRA) Sun, 27 Nov 2011 08:55:04 -0800

Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent 
if they should
---------------------------------------------------------------------------------------------


                 Key: SOLR-2921
                 URL: https://issues.apache.org/jira/browse/SOLR-2921
             Project: Solr
          Issue Type: Improvement
          Components: Schema and Analysis
    Affects Versions: 3.6, 4.0
         Environment: All
            Reporter: Erick Erickson
            Assignee: Erick Erickson
            Priority: Minor


SOLR-2918, which drastically improves the approach of SOLR-2438 creates a new 
MultiTermAwareComponent interface. This allows Solr to automatically assemble a 
"multiterm" analyzer that does the right thing vis-a-vis transforming the 
individual terms of a multi-term query at query time. Examples are: lower 
casing, folding accents, etc. Currently (27-Nov-2011), the following classes 
implement MultiTermAwareComponent:

 * ASCIIFoldingFilterFactory
 * LowerCaseFilterFactory
 * LowerCaseTokenizerFactory
 * MappingCharFilterFactory
 * PersianCharFilterFactory

When users put any of the above in their query analyzer, Solr will "do the 
right thing" at query time and the perennial question users have, "why didn't 
my wildcard query automatically lower-case (or accent fold or....) my terms?" 
will be gone. Die question die!

But taking a quick look, for instance, at the various FilterFactories that 
exist, there are a number of possibilities that *might* be good candidates for 
implementing MultiTermAwareComponent. But I really don't understand the correct 
behavior here well enough to know whether these should implement the interface 
or not. And this doesn't include other CharFilters or Tokenizers.

Actually implementing the interface is often trivial, see the classes above for 
examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which is the 
right thing in this case.

Here is a quick cull of the Filters that, just from their names, might be 
candidates. If anyone wants to take any of them on, that would be great. If all 
you can do is provide test cases, I could probably do the code part, just let 
me know.

ArabicNormalizationFilterFactory
GreekLowerCaseFilterFactory
HindiNormalizationFilterFactory
ICUFoldingFilterFactory
ICUNormalizer2FilterFactory
ICUTransformFilterFactory
IndicNormalizationFilterFactory
ISOLatin1AccentFilterFactory
PersianNormalizationFilterFactory
RussianLowerCaseFilterFactory
TurkishLowerCaseFilterFactory


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Reply via email to