I would test using multiple primary shards on a single machine. Since your 
dataset seems to fit into RAM, this could help for these longer latency 
queries.

On Thursday, July 10, 2014 12:24:26 AM UTC-7, Fin Sekun wrote:
>
> Any hints?
>
>
>
> On Monday, July 7, 2014 3:51:19 PM UTC+2, Fin Sekun wrote:
>>
>>
>> Hi,
>>
>>
>> *SCENARIO*
>>
>> Our Elasticsearch database has ~2.5 million entries. Each entry has the 
>> three analyzed fields "match", "sec_match" and "thi_match" (all contains 
>> 3-20 words) that will be used in this query:
>> https://gist.github.com/anonymous/a8d1142512e5625e4e91
>>
>>
>> ES runs on two *types of servers*:
>> (1) Real servers (system has direct access to real CPUs, no 
>> virtualization) of newest generation - Very performant!
>> (2) Cloud servers with virtualized CPUs - Poor CPUs, but this is generic 
>> for cloud services.
>>
>> See https://gist.github.com/anonymous/3098b142c2bab51feecc for (1) and 
>> (2) CPU details.
>>
>>
>> *ES settings:*
>> ES version 1.2.0 (jdk1.8.0_05)
>> ES_HEAP_SIZE = 512m (we also tested with 1024m with same results)
>> vm.max_map_count = 262144
>> ulimit -n 64000
>> ulimit -l unlimited
>> index.number_of_shards: 1
>> index.number_of_replicas: 0
>> index.store.type: mmapfs
>> threadpool.search.type: fixed
>> threadpool.search.size: 75
>> threadpool.search.queue_size: 5000
>>
>>
>> *Infrastructure*:
>> As you can see above, we don't use the cluster feature of ES (1 shard, 0 
>> replicas). The reason is that our hosting infrastructure is based on 
>> different providers.
>> Upside: We aren't dependent on a single hosting provider. Downside: Our 
>> servers aren't in the same LAN.
>>
>> This means:
>> - We cannot use ES sharding, because synchronisation via WAN (internet) 
>> seems not a useful solution.
>> - So, every ES-server has the complete dataset and we configured only one 
>> shard and no replicas for higher performance.
>> - We have a distribution process that updates the ES data on every host 
>> frequently. This process is fine for us, because updates aren't very often 
>> and perfect just-in-time ES synchronisation isn't necessary for our 
>> business case.
>> - If a server goes down/crashs, the central loadbalancer removes it (the 
>> resulting minimal packet lost is acceptable).
>>  
>>
>>
>>
>> *PROBLEM*
>>
>> For long query terms (6 and more keywords), we have very high CPU loads, 
>> even on the high performance server (1), and this leads to high response 
>> times: 1-4sec on server (1), 8-20sec on server (2). The system parameters 
>> while querying:
>> - Very high load (usually 100%) for the thread responsible CPU (the other 
>> CPUs are idle in our test scenario)
>> - No I/O load (the harddisks are fine)
>> - No RAM bottlenecks
>>
>> So, we think the file caching is working fine, because we have no I/O 
>> problems and the garbage collector seams to be happy (jstat shows very few 
>> GCs). The CPU is the problem, and ES hot-threads point to the Scorer module:
>> https://gist.github.com/anonymous/9cecfd512cb533114b7d 
>>
>>
>>
>>
>> *SUMMARY/ASSUMPTIONS*
>>
>> - Our database size isn't very big and the query not very complex.
>> - ES is designed for huge amount of data, but the key is 
>> clustering/sharding: Data distribution to many servers means smaller 
>> indices, smaller indices leads to fewer CPU load and short response times.
>> - So, our database isn't big, but to big for a single CPU and this means 
>> especially low performance (virtual) CPUs can only be used in sharding 
>> environments.
>>
>> If we don't want to lost the provider independency, we have only the 
>> following two options:
>>
>> 1) Simpler query (I think not possible in our case)
>> 2) Smaller database
>>
>>
>>
>>
>> *QUESTIONS*
>>
>> Are our assumptions correct? Especially:
>>
>> - Is clustering/sharding (also small indices) the main key to 
>> performance, that means the only possibility to prevent overloaded 
>> (virtual) CPUs?
>> - Is it right that clustering is only useful/possible in LANs?
>> - Do you have any ES configuration or architecture hints regarding our 
>> preference for using multiple hosting providers?
>>
>>
>>
>> Thank you. Rgds
>> Fin
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/046b78ca-9173-4fa0-ae5d-309a716c9dc3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to