[
https://issues.apache.org/jira/browse/SOLR-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160612#comment-13160612
]
Erick Erickson commented on SOLR-2921:
--------------------------------------
Mike:
stemmers - not going to make them MultiTermAware. No way. No how. Not on my
watch, one succinct example and I'm convinced.
The beauty of the way Yonik and Robert directed this is that we can take care
of the 80% case, not provide things that are *that* surprising and still have
all the flexibility available to those who really need it. As Robert says, if
they really want some "interesting" behavior, they can specify the complete
chain.
Robert:
I guess I'm at a loss as to how to write tests for the various filters and
tokenizers I listed, which is why I'm reluctant to just make them
MultTermAwareComponents. Do you have any suggestions as to how I could get
tests? I had enough surprises when I ran the tests in English that I'm
reluctant to just plow ahead. As far as I understand, Arabic is caseless for
instance.
I totally agree with your point that making the analysis components cope with
syntax is evil. Not going there either.
Maybe the right action is to wait for someone to volunteer to be the guinea pig
for the various filters, I suppose we could advertise for volunteers...
> Make any Filters, Tokenizers and CharFilters implement
> MultiTermAwareComponent if they should
> ---------------------------------------------------------------------------------------------
>
> Key: SOLR-2921
> URL: https://issues.apache.org/jira/browse/SOLR-2921
> Project: Solr
> Issue Type: Improvement
> Components: Schema and Analysis
> Affects Versions: 3.6, 4.0
> Environment: All
> Reporter: Erick Erickson
> Assignee: Erick Erickson
> Priority: Minor
>
> SOLR-2438 creates a new MultiTermAwareComponent interface. This allows Solr
> to automatically assemble a "multiterm" analyzer that does the right thing
> vis-a-vis transforming the individual terms of a multi-term query at query
> time. Examples are: lower casing, folding accents, etc. Currently
> (27-Nov-2011), the following classes implement MultiTermAwareComponent:
> * ASCIIFoldingFilterFactory
> * LowerCaseFilterFactory
> * LowerCaseTokenizerFactory
> * MappingCharFilterFactory
> * PersianCharFilterFactory
> When users put any of the above in their query analyzer, Solr will "do the
> right thing" at query time and the perennial question users have, "why didn't
> my wildcard query automatically lower-case (or accent fold or....) my terms?"
> will be gone. Die question die!
> But taking a quick look, for instance, at the various FilterFactories that
> exist, there are a number of possibilities that *might* be good candidates
> for implementing MultiTermAwareComponent. But I really don't understand the
> correct behavior here well enough to know whether these should implement the
> interface or not. And this doesn't include other CharFilters or Tokenizers.
> Actually implementing the interface is often trivial, see the classes above
> for examples. Note that LowerCaseTokenizerFactory returns a *Filter*, which
> is the right thing in this case.
> Here is a quick cull of the Filters that, just from their names, might be
> candidates. If anyone wants to take any of them on, that would be great. If
> all you can do is provide test cases, I could probably do the code part, just
> let me know.
> ArabicNormalizationFilterFactory
> GreekLowerCaseFilterFactory
> HindiNormalizationFilterFactory
> ICUFoldingFilterFactory
> ICUNormalizer2FilterFactory
> ICUTransformFilterFactory
> IndicNormalizationFilterFactory
> ISOLatin1AccentFilterFactory
> PersianNormalizationFilterFactory
> RussianLowerCaseFilterFactory
> TurkishLowerCaseFilterFactory
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]