On 2/5/2015 6:40 AM, Erick Erickson wrote:
> And is this intended behavior?
>
> Either this is something we need to document better (or I've just
> missed it) or I'll file a JIRA.
>
> I have a <uniqueKey> defined as "lowercase", which is just a
> KeywordTokenizer followed by a LowercaseFilter. This definition does
> not detect duplicate IDs.

I was using this exact fieldType as my uniqueKey for quite a while.  I
never had a problem with it, but I read something saying that using a
TextField type for a uniqueKey was a potential recipe for disaster, even
if it would reliably produce a single token from the input, which that
analysis chain does.  I changed it to StrField and reindexed based on that.

For many reasons other than potential problems with Solr, it's a good
idea to ensure the unique identifier field is completely normalized
before it makes it into your source repository.

It looks like you are correct about what happens with analysis on the
uniqueKey field:

https://wiki.apache.org/solr/UniqueKey#Text_field_in_the_document

IMHO a couple of things need to happen:

1) The documentation needs to be a lot clearer ... this needs mention in
more places.  A note in various schema.xml examples would be excellent. 
The reference guide may not have this information ... I haven't been
able to check thoroughly.
2) We should consider throwing a fatal error during core startup if the
uniqueKey is potentially ambiguous.  For instance if it is a TextField,
it might have analysis that will be ignored, so refusing to start the
core will bring the administrator's attention to a configuration mistake
that can lead to unexpected behavior.  Is a Trie type with a nonzero
precisionStep OK?  Internally that will produce multiple tokens, so I'm
not sure.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to