Re: Are docs updated based on comparing the id before analysis?

Erick Erickson Thu, 05 Feb 2015 09:58:44 -0800

Shawn:

Thanks for confirming I'm not completely crazy.


I don't think it's A Good Thing to _require_ that all ID normalization be
done on the client, it'd have to be done both at index and query time, too
much chance for things to get out of sync. Although I guess this is
_actually_ what happens with the string type. Hmmmm.  So I'm -1 on <2>
above as it would require this.

And having <uniqueKey>s that are text fields _is_ fraught with danger if
you tokenize it, but KeywordTokenizer doesn't. In this particular case, the
following works, but only because this data happens to have all the alpha
characters uppercase at index time:

<fieldType name="special_id" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.UpperCaseFilterFactory"/>
      </analyzer>
    </fieldType>

or even
<fieldType name="special_id" class="solr.TextField">
      <analyzer type>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.UpperCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Personally I feel like this is a JIRA, but I can see arguments the other
way as I'm not entirely sure what you'd do if multiple tokens came out of
the analysis chain. Maybe fail the document at index time?

What _is_ unreasonable IMO is that we allow this surprising behavior, so
regardless of the above I'm +1 on keeping users from being surprised by
this behavior....

Thanks!
Erick


On Thu, Feb 5, 2015 at 11:42 AM, Shawn Heisey <[email protected]> wrote:

> On 2/5/2015 6:40 AM, Erick Erickson wrote:
> > And is this intended behavior?
> >
> > Either this is something we need to document better (or I've just
> > missed it) or I'll file a JIRA.
> >
> > I have a <uniqueKey> defined as "lowercase", which is just a
> > KeywordTokenizer followed by a LowercaseFilter. This definition does
> > not detect duplicate IDs.
>
> I was using this exact fieldType as my uniqueKey for quite a while.  I
> never had a problem with it, but I read something saying that using a
> TextField type for a uniqueKey was a potential recipe for disaster, even
> if it would reliably produce a single token from the input, which that
> analysis chain does.  I changed it to StrField and reindexed based on that.
>
> For many reasons other than potential problems with Solr, it's a good
> idea to ensure the unique identifier field is completely normalized
> before it makes it into your source repository.
>
> It looks like you are correct about what happens with analysis on the
> uniqueKey field:
>
> https://wiki.apache.org/solr/UniqueKey#Text_field_in_the_document
>
> IMHO a couple of things need to happen:
>
> 1) The documentation needs to be a lot clearer ... this needs mention in
> more places.  A note in various schema.xml examples would be excellent.
> The reference guide may not have this information ... I haven't been
> able to check thoroughly.
> 2) We should consider throwing a fatal error during core startup if the
> uniqueKey is potentially ambiguous.  For instance if it is a TextField,
> it might have analysis that will be ignored, so refusing to start the
> core will bring the administrator's attention to a configuration mistake
> that can lead to unexpected behavior.  Is a Trie type with a nonzero
> precisionStep OK?  Internally that will produce multiple tokens, so I'm
> not sure.
>
> Thanks,
> Shawn
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Are docs updated based on comparing the id before analysis?

Reply via email to