Hmmm, driving away from my client, I got to wondering about routing in
SolrCloud. You'd have to apply the analysis chain _before_ you routed
on ID, and I have no clue what would happen with things like the !
operator in the id field.
So I think this is a documentation issue. I wrote a small program (see
below) that produces fantastic results. It creates a <uniqueKey> from
the letters "abcd" and randomly uppercases each letter. I tried this
on a 4 shard setup (trunk). The "id" field is a KeywordTokenizer and
UpperCaseFilter. (I assume LowerCase would have the same problem).
At the end of indexing 1,000 documents as above, the numDocs/maxDoc were:
shard1 - 316/316
shard2 - 5/320
shard3 - 297/297
shard4 - 67/67
Which indicates that the routing is sensitive to case, which is not at
all surprising when I finally stopped and _thought_.
So to handle my "rule of thumb", which is that anything that a human
could possibly enter should _not_ be case sensitive, the <uniqueKey>
field needs to be
1> normalized as far as case is concerned at index time
2> have a query-time transformation done to match <1>. So something
like this should do it assuming that
the indexer took care to uppercase the <uniqueKey>:
<fieldType name="eoe_test" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.UpperCaseFilterFactory" />
</analyzer>
</fieldType>
FWIW......
*****************
package problem;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.CloudSolrClient;
import org.apache.solr.common.SolrInputDocument;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
public class Test {
private CloudSolrClient _server;
private long _start = System.currentTimeMillis();
private int _total = 0;
public static void main(String[] args) {
try {
Test idxer = new Test("localhost:2181");
idxer.doIt();
idxer.finish();
} catch (Exception e) {
e.printStackTrace();
}
}
public Test(String zkUrl) throws IOException, SolrServerException {
_server = new CloudSolrClient(zkUrl);
_server.setDefaultCollection("eoe");
}
private void finish() throws IOException, SolrServerException {
_server.commit();
}
Random rand = new Random();
private void doIt() throws IOException, SolrServerException {
List<SolrInputDocument> list = new ArrayList<>(1000);
for (int idx = 0; idx < 1000; ++idx) {
SolrInputDocument doc = new SolrInputDocument();
StringBuilder sb = new StringBuilder();
addOne("a", sb);
addOne("b", sb);
addOne("c", sb);
addOne("e", sb);
doc.addField("id", sb.toString());
list.add(doc);
}
_server.add(list);
}
void addOne(String str, StringBuilder sb) {
if (rand.nextBoolean()) {
sb.append(str);
return;
}
sb.append(str.toUpperCase());
}
}
On Thu, Feb 5, 2015 at 1:21 PM, Shawn Heisey <[email protected]> wrote:
> On 2/5/2015 10:57 AM, Erick Erickson wrote:
>> Thanks for confirming I'm not completely crazy.
>>
>> I don't think it's A Good Thing to _require_ that all ID normalization
>> be done on the client, it'd have to be done both at index and query
>> time, too much chance for things to get out of sync. Although I guess
>> this is _actually_ what happens with the string type. Hmmmm. So I'm
>> -1 on <2> above as it would require this.
>>
>> And having <uniqueKey>s that are text fields _is_ fraught with danger
>> if you tokenize it, but KeywordTokenizer doesn't.
>
> <snip>
>
>> Personally I feel like this is a JIRA, but I can see arguments the
>> other way as I'm not entirely sure what you'd do if multiple tokens
>> came out of the analysis chain. Maybe fail the document at index time?
>>
>> What _is_ unreasonable IMO is that we allow this surprising behavior,
>> so regardless of the above I'm +1 on keeping users from being
>> surprised by this behavior....
>
> My earlier statements were written with the assumption that the current
> behavior exists because it is difficult to allow the desired behavior.
> I believe that if it were easy to do, it would have already been done.
>
> If it's possible to allow what we both think is rational user
> expectation (case-insensitive uniqueKey values), I agree that we need to
> allow it. Whether or not it's readily achievable is the question.
>
> Thanks,
> Shawn
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]