Re: Are docs updated based on comparing the id before analysis?

Erick Erickson Thu, 05 Feb 2015 16:25:32 -0800

Hmmm, driving away from my client, I got to wondering about routing in
SolrCloud. You'd have to apply the analysis chain _before_ you routed
on ID, and I have no clue what would happen with things like the !
operator in the id field.


So I think this is a documentation issue. I wrote a small program (see
below) that produces fantastic results. It creates a <uniqueKey> from
the letters "abcd" and randomly uppercases each letter. I tried this
on a 4 shard setup (trunk). The "id" field is a KeywordTokenizer and
UpperCaseFilter. (I assume LowerCase would have the same problem).

At the end of indexing 1,000 documents as above, the numDocs/maxDoc were:
shard1 - 316/316
shard2 - 5/320
shard3 - 297/297
shard4 - 67/67

Which indicates that the routing is sensitive to case, which is not at
all surprising when I finally stopped and _thought_.

So to handle my "rule of thumb", which is that anything that a human
could possibly enter should _not_ be case sensitive, the <uniqueKey>
field needs to be
1> normalized as far as case is concerned at index time
2> have a query-time transformation done to match <1>. So something
like this should do it assuming that
    the indexer took care to uppercase the <uniqueKey>:
    <fieldType name="eoe_test" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
      </analyzer>
     <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.UpperCaseFilterFactory" />
      </analyzer>
    </fieldType>


FWIW......

*****************

package problem;


import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.CloudSolrClient;
import org.apache.solr.common.SolrInputDocument;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;


public class Test {
  private CloudSolrClient _server;
  private long _start = System.currentTimeMillis();
  private int _total = 0;


  public static void main(String[] args) {
    try {
      Test idxer = new Test("localhost:2181");
      idxer.doIt();
      idxer.finish();
    } catch (Exception e) {
      e.printStackTrace();
    }
  }

  public Test(String zkUrl) throws IOException, SolrServerException {
    _server = new CloudSolrClient(zkUrl);
    _server.setDefaultCollection("eoe");
  }

  private void finish() throws IOException, SolrServerException {
    _server.commit();
  }
  Random rand = new Random();

  private void doIt() throws IOException, SolrServerException {
    List<SolrInputDocument> list = new ArrayList<>(1000);

    for (int idx = 0; idx < 1000; ++idx) {
      SolrInputDocument doc = new SolrInputDocument();

      StringBuilder sb = new StringBuilder();
      addOne("a", sb);
      addOne("b", sb);
      addOne("c", sb);
      addOne("e", sb);

      doc.addField("id", sb.toString());
      list.add(doc);

    }
    _server.add(list);

  }

  void addOne(String str, StringBuilder sb) {
    if (rand.nextBoolean()) {
      sb.append(str);
      return;
    }
    sb.append(str.toUpperCase());
  }
}

On Thu, Feb 5, 2015 at 1:21 PM, Shawn Heisey <[email protected]> wrote:
> On 2/5/2015 10:57 AM, Erick Erickson wrote:
>> Thanks for confirming I'm not completely crazy.
>>
>> I don't think it's A Good Thing to _require_ that all ID normalization
>> be done on the client, it'd have to be done both at index and query
>> time, too much chance for things to get out of sync. Although I guess
>> this is _actually_ what happens with the string type. Hmmmm.  So I'm
>> -1 on <2> above as it would require this.
>>
>> And having <uniqueKey>s that are text fields _is_ fraught with danger
>> if you tokenize it, but KeywordTokenizer doesn't.
>
> <snip>
>
>> Personally I feel like this is a JIRA, but I can see arguments the
>> other way as I'm not entirely sure what you'd do if multiple tokens
>> came out of the analysis chain. Maybe fail the document at index time?
>>
>> What _is_ unreasonable IMO is that we allow this surprising behavior,
>> so regardless of the above I'm +1 on keeping users from being
>> surprised by this behavior....
>
> My earlier statements were written with the assumption that the current
> behavior exists because it is difficult to allow the desired behavior.
> I believe that if it were easy to do, it would have already been done.
>
> If it's possible to allow what we both think is rational user
> expectation (case-insensitive uniqueKey values), I agree that we need to
> allow it.  Whether or not it's readily achievable is the question.
>
> Thanks,
> Shawn
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Are docs updated based on comparing the id before analysis?

Reply via email to