Re: ELK stack needs tuning

Mark Walkom Fri, 18 Apr 2014 15:17:28 -0700

If you want unlimited retention you're going to have to keep adding more
nodes to the cluster to deal with it.


Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: [email protected]
web: www.campaignmonitor.com


On 17 April 2014 22:48, R. Toma <[email protected]> wrote:

> Hi Mark,
>
> Thank you for your comments.
>
> Regarding the monitoring. We use the Diamond ES collector which saves
> metrics every 30 seconds in Graphite. ElasticHQ is nice, but does
> diagnostics calculations for the whole runtime of the cluster instead of
> last X minutes. It does have nice diagnostics rules, so I created Graphite
> dashboards for them. Marvel is surely nice, but with exception of Sense it
> does not offer me anything I do not already have with Graphite.
>
> New finds:
> * Setting index.codec.bloom.load=false on yesterdays/older indices frees
> up memory from the fielddata pool. This stays released even when searching.
> * Closing older indices speeds up indexing & refreshing.
>
> Regarding the closing benefit. The impact on refreshing is great! But from
> a functional point-of-view its bad. I know about the 'overhead per index',
> but cannot find a solution to this.
>
> Does anyone know how to get an ELK stack with "unlimited" retention?
>
> Regards,
> Renzo
>
>
>
> Op woensdag 16 april 2014 11:15:32 UTC+2 schreef Mark Walkom:
>>
>> Well once you go over 31-32GB of heap you lose pointer compression which
>> can actually slow you down. You might be better off reducing that and
>> running multiple instances per physical.
>>
>> >0.90.4 or so compression is on by default, so no need to specify that.
>> You might also want to change shards to a factor of your nodes, eg 3, 6, 9
>> for more even allocation.
>> Also try moving to java 1.7u25 as that is the general agreed version to
>> run. We run u51 with no issues though so that might be worth trialling if
>> you can.
>>
>> Finally, what are you using to monitor the actual cluster? Something like
>> ElasticHQ or Marvel will probably provide greater insights into what is
>> happening and what you can do to improve performance.
>>
>> Regards,
>> Mark Walkom
>>
>> Infrastructure Engineer
>> Campaign Monitor
>> email: [email protected]
>> web: www.campaignmonitor.com
>>
>>
>> On 16 April 2014 19:06, R. Toma <[email protected]> wrote:
>>
>>> Hi all,
>>>
>>> At bol.com we use ELK for a logsearch platform, using 3 machines.
>>>
>>> We need fast indexing (to not loose events) and want fast & near
>>> realtime search. The search is currently not fast enough. Simple "give me
>>> the last 50 events from the last 15 minutes, from any type, from todays
>>> indices, without any terms" search queries may take 1.0 sec. Sometimes even
>>> passing 30 seconds.
>>>
>>> It currently does 3k docs added per second, but we expect 8k/sec end of
>>> this year.
>>>
>>> I have included lots of specs/config at bottom of this e-mail.
>>>
>>>
>>> We found 2 reliable knobs to turn:
>>>
>>>    1. index.refresh_interval. At 1 sec fast search seems impossible.
>>>    When upping the refresh to 5 sec, search gets faster. At 10 sec its even
>>>    faster. But when you search during the refresh (wouldn't a splay be 
>>> nice?)
>>>    its slow again. And a refresh every 10 seconds is not near realtime
>>>    anymore. No obvious bottlenecks present: cpu, network, memory, disk i/o 
>>> all
>>>    OK.
>>>    2. deleting old indices. No clue why this improves things. And we
>>>    really do not want to delete old data, since we want to keep at least 60
>>>    days of data online. But after deleting old data to search speed slowly
>>>    crawls back up again...
>>>
>>>
>>> We have zillions of metrics ("measure everything") of OS, ES and JVM
>>> using Diamond and Graphite. Too much to include here.
>>> We use a nagios check to simulates Kibana queries to monitor the search
>>> speed every 5 minute.
>>>
>>>
>>> When comparing behaviour at refresh_interval 1s vs 5s we see:
>>>
>>>    - system% cpu load: depends per server: 150 vs 80, 100 vs 50, 40 vs
>>>    25 == lower
>>>    - ParNew GC run freqency: 1 vs 0.6 (per second) == less
>>>    - GMS GC run frequency: 1 vs 4 (per hour) == more
>>>    - avg index time: 8 vs 2.5 (ms) == lower
>>>    - refresh frequency: 22 vs 12 (per second) -- still high numbers at
>>>    5 sec because we have 17 active indices every day == less
>>>    - merge frequency: 12 vs 7 (per second) == less
>>>    - flush frequency: no difference
>>>    - search speed: at 1s way too slow, at 5s (at tests timed between
>>>    the refresh bursts) search calls ~50ms.
>>>
>>>
>>> We already looked at the threadpools:
>>>
>>>    - we increased the bulk pool
>>>    - we currently do not have any rejects in any pools
>>>    - only pool that has queueing (a spike per 1 or 2 hours) is the
>>>    'management' pool (but thats probably Diamond)
>>>
>>>
>>> We have a feeling something blocks/locks upon high index and high search
>>> frequency. But what? I have looked at nearly all metrics and _cat output.
>>>
>>>
>>> Our current list of untested/wild ideas:
>>>
>>>    - Is the index.codec.bloom.load=false on yesterday's indices really
>>>    the magic bullet? We haven't tried it.
>>>    - Adding a 2nd JVM per machine is an option, but as long as we do
>>>    not know the real cause its not a real option (yet).
>>>    - Lowering the heap from 48GB to 30GB, to avoid the 64-bit overhead.
>>>
>>>
>>> What knobs do you suggest we start turning?
>>>
>>> Any help is much appreciated!
>>>
>>>
>>> A little present from me in return: I suggest you read
>>> http://www.elasticsearch.org/guide/en/elasticsearch/
>>> reference/current/modules-scripting.html and decide if you need dynamic
>>> scripting enabled (the default) as it allows for remote code execution via
>>> the rest api. Credits go to Byron at Trifork!
>>>
>>>
>>>
>>> More details:
>>>
>>> Versions:
>>>
>>>    - ES 1.0.1 on: java version "1.7.0_17", Java(TM) SE Runtime
>>>    Environment (build 1.7.0_17-b02), Java HotSpot(TM) 64-Bit Server VM 
>>> (build
>>>    23.7-b01, mixed mode)
>>>    - Logstash 1.1.13 (with a backported elasticsearch_http plugin, for
>>>    idle_flush_time support)
>>>    - Kibana 2
>>>
>>>
>>> Setup:
>>>
>>>    - we use several types of shippers/feeders, all sending logging to a
>>>    set of redis servers (the log4j and accesslog shippers/feeders use the
>>>    logstash json format to avoid grokking at logstash side)
>>>    - several logstash instances consume the redis list, process and
>>>    store in ES using the bulk API (we use bulk because we dislike the 
>>> version
>>>    lockin using the native transport)
>>>    - we use bulk async (we thought it would speed up indexing, which it
>>>    doesn't)
>>>    - we use bulk batch size of 1000 and idle flush of 1.0 second
>>>
>>>
>>> Hardware for ES:
>>>
>>>    - 3x HP 360G8 24x core
>>>    - each machine has 256GB RAM (1 ES jvm running per machine with 48GB
>>>    heap, so lots of free RAM for caching)
>>>    - each machine has 8x 1TB SAS (1 for OS and 7 as separate disks for
>>>    use in ES' -Des.path.data=....)
>>>
>>>
>>> Logstash integration:
>>>
>>>    - using Bulk API, to avoid the version lockin (maybe slower, which
>>>    we can fix by scaling out / adding more logstash instances)
>>>    - 17 new indices every day (e.g. syslog, accesslogging, log4j +
>>>    stacktraces)
>>>
>>>
>>> ES configuration:
>>>
>>>    - ES_HEAP_SIZE: 48gb
>>>    - index.number_of_shards: 5
>>>    - index.number_of_replicas: 1
>>>    - index.refresh_interval: 1s
>>>    - index.store.compress.stored: true
>>>    - index.translog.flush_threshold_ops: 50000
>>>    - indices.memory.index_buffer_size: 50%
>>>    - default index mapping
>>>
>>>
>>> Regards,
>>> Renzo Toma
>>> Bol.com
>>>
>>>
>>> p.s. we are hiring! :-)
>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/0da7e8fc-813b-4755-9fea-a49bc9eac1b6%
>>> 40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/0da7e8fc-813b-4755-9fea-a49bc9eac1b6%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/7dd33f9e-28b0-4308-b6c6-59cc01bd302e%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/7dd33f9e-28b0-4308-b6c6-59cc01bd302e%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAEM624aD1tJT3vh8SkhRCab4Jyzmex_5YBbkFNdMoxpnGK6%2BRQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: ELK stack needs tuning

Reply via email to