Are docs updated based on comparing the id before analysis?

Erick Erickson Thu, 05 Feb 2015 05:41:35 -0800

And is this intended behavior?

Either this is something we need to document better (or I've just missed
it) or I'll file a JIRA.


I have a <uniqueKey> defined as "lowercase", which is just a
KeywordTokenizer followed by a LowercaseFilter. This definition does not
detect duplicate IDs.

I'm guessing that the check (at a client, can't dig too much this morning)
for whether to replace a document is happening _before_ the id goes through
the analysis chain, which is a surprise to me.

So if the ID contains upper-case letters, it is not replaced and we have N
live docs with the same ID.

I'd argue this is a case that should be supported on the basis of my "rule
of thumb" that anything a human might enter should at least not be
case-sensitive on search. Since the <uniqueKey> is very often something
like a catalog number or similar, at least lowercasing should be supported.

Of course what that means if/when an analysis chain is more complex
is...er...interesting.

Question of course is whether this is expected behavior and I have to, you
know, remember it or I'll file a JIRA.

Thanks!
Erick

Are docs updated based on comparing the id before analysis?

Reply via email to