On 18/11/14 14:24, Ferenczi, Jim | EURHQ wrote:
  > Your third factoid: A high number of hits/shard, suggests that there is a 
possibility of all the final top-1000 hits to originate from a single shard.
In fact if you ask for 1000 hits in a distributed SolrCloud, each shard has to 
retrieve 1000 hits to get the unique key of each match and send it back to the 
shard responsible for the merge.
Yes, at least if each shard has 1000 hits. It is when each shard has a lot of actual hits this "issue" becomes a problem
This means that even if your data is fairly distributed among the 1000 shards, 
they still have to decompress 1000 documents during the first phase of the 
search.
Eactly!
There are ways to avoid this, for instance you can check this JIRA where the 
idea is discussed:
https://issues.apache.org/jira/browse/SOLR-5478
Guess our solution (described in my previous mail) is kinda an alternative solution to SOLR-5478
Bottom line is that if you have 1000 shards the GET_FIELDS stage should be fast 
(if your data is fairly distributed) but the GET_TOP_IDS is not.
Exactly!
You could avoid a lot of decompression/reads by using the field cache to 
retrieve the unique key in the first stage.
We have so much data and relatively little RAM, so we cannot use field-cache, because it requires an amount of memory linearly dependent on the number of docs in store. We can never fulfill this requirement. Doc-values is a valid approach for us, but currently our id-field is unfortunately not doc-value - at it is not easy for us to just re-index all documents with id as doc-value. Besides that, our solution is diagonal on a field-cache/doc-values solution in the way that one does not prevent the other, and if you do one of them you will still be able to benefit from doing the other one.

Cheers,
Jim
Thanks, Jim

Reply via email to