Re: Are docs updated based on comparing the id before analysis?

Shawn Heisey Thu, 05 Feb 2015 16:55:05 -0800

On 2/5/2015 5:24 PM, Erick Erickson wrote:
> Hmmm, driving away from my client, I got to wondering about routing in
> SolrCloud. You'd have to apply the analysis chain _before_ you routed
> on ID, and I have no clue what would happen with things like the !
> operator in the id field.


I didn't even think about SolrCloud.  Fun.

> So to handle my "rule of thumb", which is that anything that a human
> could possibly enter should _not_ be case sensitive, the <uniqueKey>
> field needs to be
> 1> normalized as far as case is concerned at index time
> 2> have a query-time transformation done to match <1>. So something
> like this should do it assuming that
>     the indexer took care to uppercase the <uniqueKey>:
>     <fieldType name="eoe_test" class="solr.TextField">
>       <analyzer type="index">
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>       </analyzer>
>      <analyzer type="query">
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <filter class="solr.UpperCaseFilterFactory" />
>       </analyzer>
>     </fieldType>

I realize with what I'm saying below that it is outside "typical user"
land, but it might work.  For an advanced user it wouldn't even be all
that messy.  Proceeding into "thinking out loud" territory:

A custom UpdateRequestProcessor could do all the normalization on the
uniqueKey field at index time.  If we used that processor in combination
with a fieldType like the one you outlined above, I think it would
work.  The simple version of that processor would just be a
case-changing filter.

Getting back to what a typical user wants to happen ... an update
processor could be included in Solr that figures out the configured
uniqueKey field and lowercases the input on that field.  We could
provide documentation showing how to insert it into the default update
chain to allow case-insensitive unique IDs.  If somebody needs more
complicated normalization (perhaps they want to use the ICU folding
class instead of Java's built-in lowercase capability, or do some really
wild stuff that's domain-specific), they can write their own processor,
and maybe even their own analysis component for the query side.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Are docs updated based on comparing the id before analysis?

Reply via email to