Hi all,

At bol.com we use ELK for a logsearch platform, using 3 machines.

We need fast indexing (to not loose events) and want fast & near realtime 
search. The search is currently not fast enough. Simple "give me the last 
50 events from the last 15 minutes, from any type, from todays indices, 
without any terms" search queries may take 1.0 sec. Sometimes even passing 
30 seconds.

It currently does 3k docs added per second, but we expect 8k/sec end of 
this year.

I have included lots of specs/config at bottom of this e-mail.


We found 2 reliable knobs to turn:

   1. index.refresh_interval. At 1 sec fast search seems impossible. When 
   upping the refresh to 5 sec, search gets faster. At 10 sec its even faster. 
   But when you search during the refresh (wouldn't a splay be nice?) its slow 
   again. And a refresh every 10 seconds is not near realtime anymore. No 
   obvious bottlenecks present: cpu, network, memory, disk i/o all OK.
   2. deleting old indices. No clue why this improves things. And we really 
   do not want to delete old data, since we want to keep at least 60 days of 
   data online. But after deleting old data to search speed slowly crawls back 
   up again...


We have zillions of metrics ("measure everything") of OS, ES and JVM using 
Diamond and Graphite. Too much to include here.
We use a nagios check to simulates Kibana queries to monitor the search 
speed every 5 minute.


When comparing behaviour at refresh_interval 1s vs 5s we see:

   - system% cpu load: depends per server: 150 vs 80, 100 vs 50, 40 vs 25 
   == lower
   - ParNew GC run freqency: 1 vs 0.6 (per second) == less
   - GMS GC run frequency: 1 vs 4 (per hour) == more
   - avg index time: 8 vs 2.5 (ms) == lower
   - refresh frequency: 22 vs 12 (per second) -- still high numbers at 5 
   sec because we have 17 active indices every day == less
   - merge frequency: 12 vs 7 (per second) == less
   - flush frequency: no difference
   - search speed: at 1s way too slow, at 5s (at tests timed between the 
   refresh bursts) search calls ~50ms.


We already looked at the threadpools:

   - we increased the bulk pool
   - we currently do not have any rejects in any pools
   - only pool that has queueing (a spike per 1 or 2 hours) is the 
   'management' pool (but thats probably Diamond)


We have a feeling something blocks/locks upon high index and high search 
frequency. But what? I have looked at nearly all metrics and _cat output.


Our current list of untested/wild ideas:

   - Is the index.codec.bloom.load=false on yesterday's indices really the 
   magic bullet? We haven't tried it.
   - Adding a 2nd JVM per machine is an option, but as long as we do not 
   know the real cause its not a real option (yet).
   - Lowering the heap from 48GB to 30GB, to avoid the 64-bit overhead.


What knobs do you suggest we start turning?

Any help is much appreciated!


A little present from me in return: I suggest you read 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html
 
and decide if you need dynamic scripting enabled (the default) as it allows 
for remote code execution via the rest api. Credits go to Byron at Trifork!



More details:

Versions:

   - ES 1.0.1 on: java version "1.7.0_17", Java(TM) SE Runtime Environment 
   (build 1.7.0_17-b02), Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, 
   mixed mode)
   - Logstash 1.1.13 (with a backported elasticsearch_http plugin, for 
   idle_flush_time support)
   - Kibana 2
   

Setup:

   - we use several types of shippers/feeders, all sending logging to a set 
   of redis servers (the log4j and accesslog shippers/feeders use the logstash 
   json format to avoid grokking at logstash side)
   - several logstash instances consume the redis list, process and store 
   in ES using the bulk API (we use bulk because we dislike the version lockin 
   using the native transport)
   - we use bulk async (we thought it would speed up indexing, which it 
   doesn't)
   - we use bulk batch size of 1000 and idle flush of 1.0 second
   

Hardware for ES:

   - 3x HP 360G8 24x core
   - each machine has 256GB RAM (1 ES jvm running per machine with 48GB 
   heap, so lots of free RAM for caching)
   - each machine has 8x 1TB SAS (1 for OS and 7 as separate disks for use 
   in ES' -Des.path.data=....)
   

Logstash integration:

   - using Bulk API, to avoid the version lockin (maybe slower, which we 
   can fix by scaling out / adding more logstash instances)
   - 17 new indices every day (e.g. syslog, accesslogging, log4j + 
   stacktraces)
   

ES configuration:

   - ES_HEAP_SIZE: 48gb
   - index.number_of_shards: 5
   - index.number_of_replicas: 1
   - index.refresh_interval: 1s
   - index.store.compress.stored: true
   - index.translog.flush_threshold_ops: 50000
   - indices.memory.index_buffer_size: 50%
   - default index mapping


Regards,
Renzo Toma
Bol.com


p.s. we are hiring! :-)


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0da7e8fc-813b-4755-9fea-a49bc9eac1b6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to