Thanks for your answers. @Jörg: Of course sharding is primarly curcial for scalability, but it seems (and nobody disagreed in this posts) that big indices lead to high CPU load, so sharding is curcial for performance too.
@smonasco: - The query will be put together by an application that sets always a filter field. I would be suprised if ES doesn't efficiently ignore this part. - There is no sort field, so ES sorts by score. - Offsets etc. are default for alle analyzed fields. @Kireet: Hm, I think only one CPU was being used, but I'm not sure anymore, the test was a long time ago. As soon as I have time I will test with number of shards smaller than number of CPUs. On Monday, July 21, 2014 5:46:38 PM UTC+2, Kireet Reddy wrote: > > My working assumption had been that elasticsearch executes queries across > all shards in parallel and then merges the results. So maybe shards <= cpu > cores would help in this case where there is only one concurrent query. But > I have never tested this assumption, out of curiosity during the 20 shard > test did you still only see 1 cpu being used? Did you try 2 shards and get > the same results? > > On Jul 20, 2014, at 1:01 AM, 'Fin Sekun' via elasticsearch < > [email protected] <javascript:>> wrote: > > Hi Kireet, thanks for your answer and sorry for the late response. More > shards doesn't help. It will slow down the system because each shard takes > quite some overhead to maintain a Lucene index and, the smaller the shards, > the bigger the overhead. Having more shards enhances the indexing > performance and allows to distribute a big index across machines, but I > don't have a cluster with a lot of machines. I could observe this negative > effects while testing with 20 shards. > > It would be very cool if somebody could answer/comment to the question > summarized at the end of my post. Thanks again. > > > > > > On Friday, July 11, 2014 3:02:50 AM UTC+2, Kireet Reddy wrote: >> >> I would test using multiple primary shards on a single machine. Since >> your dataset seems to fit into RAM, this could help for these longer >> latency queries. >> >> On Thursday, July 10, 2014 12:24:26 AM UTC-7, Fin Sekun wrote: >>> >>> Any hints? >>> >>> >>> >>> On Monday, July 7, 2014 3:51:19 PM UTC+2, Fin Sekun wrote: >>>> >>>> >>>> Hi, >>>> >>>> >>>> *SCENARIO* >>>> >>>> Our Elasticsearch database has ~2.5 million entries. Each entry has the >>>> three analyzed fields "match", "sec_match" and "thi_match" (all contains >>>> 3-20 words) that will be used in this query: >>>> https://gist.github.com/anonymous/a8d1142512e5625e4e91 >>>> >>>> >>>> ES runs on two *types of servers*: >>>> (1) Real servers (system has direct access to real CPUs, no >>>> virtualization) of newest generation - Very performant! >>>> (2) Cloud servers with virtualized CPUs - Poor CPUs, but this is >>>> generic for cloud services. >>>> >>>> See https://gist.github.com/anonymous/3098b142c2bab51feecc for (1) and >>>> (2) CPU details. >>>> >>>> >>>> *ES settings:* >>>> ES version 1.2.0 (jdk1.8.0_05) >>>> ES_HEAP_SIZE = 512m (we also tested with 1024m with same results) >>>> vm.max_map_count = 262144 >>>> ulimit -n 64000 >>>> ulimit -l unlimited >>>> index.number_of_shards: 1 >>>> index.number_of_replicas: 0 >>>> index.store.type: mmapfs >>>> threadpool.search.type: fixed >>>> threadpool.search.size: 75 >>>> threadpool.search.queue_size: 5000 >>>> >>>> >>>> *Infrastructure*: >>>> As you can see above, we don't use the cluster feature of ES (1 shard, >>>> 0 replicas). The reason is that our hosting infrastructure is based on >>>> different providers. >>>> Upside: We aren't dependent on a single hosting provider. Downside: Our >>>> servers aren't in the same LAN. >>>> >>>> This means: >>>> - We cannot use ES sharding, because synchronisation via WAN (internet) >>>> seems not a useful solution. >>>> - So, every ES-server has the complete dataset and we configured only >>>> one shard and no replicas for higher performance. >>>> - We have a distribution process that updates the ES data on every host >>>> frequently. This process is fine for us, because updates aren't very often >>>> and perfect just-in-time ES synchronisation isn't necessary for our >>>> business case. >>>> - If a server goes down/crashs, the central loadbalancer removes it >>>> (the resulting minimal packet lost is acceptable). >>>> >>>> >>>> >>>> >>>> *PROBLEM* >>>> >>>> For long query terms (6 and more keywords), we have very high CPU >>>> loads, even on the high performance server (1), and this leads to high >>>> response times: 1-4sec on server (1), 8-20sec on server (2). The system >>>> parameters while querying: >>>> - Very high load (usually 100%) for the thread responsible CPU (the >>>> other CPUs are idle in our test scenario) >>>> - No I/O load (the harddisks are fine) >>>> - No RAM bottlenecks >>>> >>>> So, we think the file caching is working fine, because we have no I/O >>>> problems and the garbage collector seams to be happy (jstat shows very few >>>> GCs). The CPU is the problem, and ES hot-threads point to the Scorer >>>> module: >>>> https://gist.github.com/anonymous/9cecfd512cb533114b7d >>>> >>>> >>>> >>>> >>>> *SUMMARY/ASSUMPTIONS* >>>> >>>> - Our database size isn't very big and the query not very complex. >>>> - ES is designed for huge amount of data, but the key is >>>> clustering/sharding: Data distribution to many servers means smaller >>>> indices, smaller indices leads to fewer CPU load and short response times. >>>> - So, our database isn't big, but to big for a single CPU and this >>>> means especially low performance (virtual) CPUs can only be used in >>>> sharding environments. >>>> >>>> If we don't want to lost the provider independency, we have only the >>>> following two options: >>>> >>>> 1) Simpler query (I think not possible in our case) >>>> 2) Smaller database >>>> >>>> >>>> >>>> >>>> *QUESTIONS* >>>> >>>> Are our assumptions correct? Especially: >>>> >>>> - Is clustering/sharding (also small indices) the main key to >>>> performance, that means the only possibility to prevent overloaded >>>> (virtual) CPUs? >>>> - Is it right that clustering is only useful/possible in LANs? >>>> - Do you have any ES configuration or architecture hints regarding our >>>> preference for using multiple hosting providers? >>>> >>>> >>>> >>>> Thank you. Rgds >>>> Fin >>>> >>>> > -- > You received this message because you are subscribed to a topic in the > Google Groups "elasticsearch" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/elasticsearch/TpEboTt_8FY/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected] <javascript:>. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/7e1b3a52-23b4-4a22-8433-985e07ae7904%40googlegroups.com > > <https://groups.google.com/d/msgid/elasticsearch/7e1b3a52-23b4-4a22-8433-985e07ae7904%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fac66eb6-8b1f-4aaa-848b-fe54654dfdc5%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
