[jira] [Updated] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Erick Erickson (Updated) (JIRA) Wed, 21 Mar 2012 08:36:05 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Erick Erickson updated SOLR-2921:
---------------------------------

    Attachment: SOLR-2921-3x.patch

Here's a first cut at these. The tests in TestFoldingMultitermExtrasQuery are 
especially weak, any help here would be extremely welcome....

Basically, I stole the patterns from the associated filters and removed the 
ones that failed for reasons I didn't understand. And I haven't checked the 
remaining all that carefully, I have some stuff coming up for most of the rest 
of today and wanted to get the first cut out in front of people.

The attached patch applies against 3x, I'll need to tweak it for trunk but 
won't bother until after we finalize this.

I also haven't run the full test suite, so this patch should NOT be committed 
yet.

I'm not even going to try the following, I don't even know what to expect as 
proper results. If nobody steps up I'll split these out into another JIRA and 
hopefully someone with the appropriate knowledge (and keyboard) can volunteer:
   ArabicNormalizationFilterFactory
   HindiNormalizationFilterFactory
   IndicNormalizationFilterFactory
   PersianNormalizationFilterFactory
   ICUTransformFilterFactory  
                
> Make any Filters, Tokenizers and CharFilters implement 
> MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2921
>                 URL: https://issues.apache.org/jira/browse/SOLR-2921
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>         Environment: All
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Minor
>         Attachments: SOLR-2921-3x.patch
>
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr 
> to automatically assemble a "multiterm" analyzer that does the right thing 
> vis-a-vis transforming the individual terms of a multi-term query at query 
> time. Examples are: lower casing, folding accents, etc. Currently 
> (27-Nov-2011), the following classes implement MultiTermAwareComponent:
>  * ASCIIFoldingFilterFactory
>  * LowerCaseFilterFactory
>  * LowerCaseTokenizerFactory
>  * MappingCharFilterFactory
>  * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the 
> right thing" at query time and the perennial question users have, "why didn't 
> my wildcard query automatically lower-case (or accent fold or....) my terms?" 
> will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that 
> exist, there are a number of possibilities that *might* be good candidates 
> for implementing MultiTermAwareComponent. But I really don't understand the 
> correct behavior here well enough to know whether these should implement the 
> interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above 
> for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which 
> is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be 
> candidates. If anyone wants to take any of them on, that would be great. If 
> all you can do is provide test cases, I could probably do the code part, just 
> let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-2921) Make any Filters, Tokenizers and CharFilters implement MultiTermAwareComponent if they should

Reply via email to