Re: Clustering/Sharding impact on query performance

'Fin Sekun' via elasticsearch Thu, 10 Jul 2014 00:25:20 -0700

Any hints?



On Monday, July 7, 2014 3:51:19 PM UTC+2, Fin Sekun wrote:
>
>
> Hi,
>
>
> *SCENARIO*
>
> Our Elasticsearch database has ~2.5 million entries. Each entry has the 
> three analyzed fields "match", "sec_match" and "thi_match" (all contains 
> 3-20 words) that will be used in this query:
> https://gist.github.com/anonymous/a8d1142512e5625e4e91
>
>
> ES runs on two *types of servers*:
> (1) Real servers (system has direct access to real CPUs, no 
> virtualization) of newest generation - Very performant!
> (2) Cloud servers with virtualized CPUs - Poor CPUs, but this is generic 
> for cloud services.
>
> See https://gist.github.com/anonymous/3098b142c2bab51feecc for (1) and 
> (2) CPU details.
>
>
> *ES settings:*
> ES version 1.2.0 (jdk1.8.0_05)
> ES_HEAP_SIZE = 512m (we also tested with 1024m with same results)
> vm.max_map_count = 262144
> ulimit -n 64000
> ulimit -l unlimited
> index.number_of_shards: 1
> index.number_of_replicas: 0
> index.store.type: mmapfs
> threadpool.search.type: fixed
> threadpool.search.size: 75
> threadpool.search.queue_size: 5000
>
>
> *Infrastructure*:
> As you can see above, we don't use the cluster feature of ES (1 shard, 0 
> replicas). The reason is that our hosting infrastructure is based on 
> different providers.
> Upside: We aren't dependent on a single hosting provider. Downside: Our 
> servers aren't in the same LAN.
>
> This means:
> - We cannot use ES sharding, because synchronisation via WAN (internet) 
> seems not a useful solution.
> - So, every ES-server has the complete dataset and we configured only one 
> shard and no replicas for higher performance.
> - We have a distribution process that updates the ES data on every host 
> frequently. This process is fine for us, because updates aren't very often 
> and perfect just-in-time ES synchronisation isn't necessary for our 
> business case.
> - If a server goes down/crashs, the central loadbalancer removes it (the 
> resulting minimal packet lost is acceptable).
>  
>
>
>
> *PROBLEM*
>
> For long query terms (6 and more keywords), we have very high CPU loads, 
> even on the high performance server (1), and this leads to high response 
> times: 1-4sec on server (1), 8-20sec on server (2). The system parameters 
> while querying:
> - Very high load (usually 100%) for the thread responsible CPU (the other 
> CPUs are idle in our test scenario)
> - No I/O load (the harddisks are fine)
> - No RAM bottlenecks
>
> So, we think the file caching is working fine, because we have no I/O 
> problems and the garbage collector seams to be happy (jstat shows very few 
> GCs). The CPU is the problem, and ES hot-threads point to the Scorer module:
> https://gist.github.com/anonymous/9cecfd512cb533114b7d 
>
>
>
>
> *SUMMARY/ASSUMPTIONS*
>
> - Our database size isn't very big and the query not very complex.
> - ES is designed for huge amount of data, but the key is 
> clustering/sharding: Data distribution to many servers means smaller 
> indices, smaller indices leads to fewer CPU load and short response times.
> - So, our database isn't big, but to big for a single CPU and this means 
> especially low performance (virtual) CPUs can only be used in sharding 
> environments.
>
> If we don't want to lost the provider independency, we have only the 
> following two options:
>
> 1) Simpler query (I think not possible in our case)
> 2) Smaller database
>
>
>
>
> *QUESTIONS*
>
> Are our assumptions correct? Especially:
>
> - Is clustering/sharding (also small indices) the main key to performance, 
> that means the only possibility to prevent overloaded (virtual) CPUs?
> - Is it right that clustering is only useful/possible in LANs?
> - Do you have any ES configuration or architecture hints regarding our 
> preference for using multiple hosting providers?
>
>
>
> Thank you. Rgds
> Fin
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/18434f95-5587-48fe-bc6e-214155decec0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Clustering/Sharding impact on query performance

Reply via email to