Re: Decision on Number of shards and collection

Denis Bazhenov Wed, 11 Apr 2018 18:12:33 -0700

Hello.

The answer will depend on the context of the system. I'll give you my point of 
view from perspective of developing and supporting medium to large scale search 
systems (400M documents, 40 shards, about 20 collections, 30-40+ physical 
servers)

Basically, I'd recommend:

1. if there are distinct sets of documents which are never will be queried 
together (no matter heterogeneous or not), always split them in different 
collections. This will reduce search time, index time and so forth. It may be a 
little bit more complex from operational point of view, though.

2. with sharding (splitting logically homogenous index in parts) I have no easy 
answer, but it's kind of opposite... Basically, inverted index is very 
efficient data structure. And most efficient (in terms of CPU time spent per 
query) implementation of a search a system is single index system (no 
sharding). Sadly, such system will suffer from low search throughput. When 
splitting index you increase search throughput, but also increase the cost of 
processing a single query. The more hardware you have the more important 
efficiency will be.

This implies that if you have room for increasing search throughput using 
replicas instead of sharding, you should do it. It's more efficient and more 
simple way, but only if:

1. index is small enough to fit inside RAM of a single box;
2. your search queries/algorithms is efficient enough in terms of GC pressure 
for one box to handle reasonable amount of requests.

> On Apr 11, 2018, at 20:16, neotorand <neotor...@gmail.com> wrote:
> 
> Hi Team 
> First of all i take this opportunity to thank you all for creating a
> beautiful place where people can explore ,learn and debate. 
> 
> I have been on my knees for couple of days to decide on this. 
> 
> When i am creating a solr cloud eco system i need to decide on number of
> shards and collection. 
> What are the best practices for taking this decisions. 
> 
> I believe heterogeneous data can be indexed to same collection and i can
> have multiple shards for the index to be partitioned.So whats the need of a
> second collection?. yes when collection size grows i should look for more
> collection.what exactly that size is? what KPI drives the decision of having
> more collection?Any pointers or links for best practice. 
> 
> when should i go for multiple shards? 
> yes when shard size grows.Right? whats the size and how do i benchmark. 
> 
> I am sorry for my question if its already asked but googled all the ecospace
> quora,stackoverflow,lucid 
> 
> Regards 
> Neo
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Lucene-Java-Users-f532864.html
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 

---
Denis Bazhenov <dot...@gmail.com>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Decision on Number of shards and collection

Reply via email to