If you want unlimited retention you're going to have to keep adding more nodes to the cluster to deal with it.
Regards, Mark Walkom Infrastructure Engineer Campaign Monitor email: [email protected] web: www.campaignmonitor.com On 17 April 2014 22:48, R. Toma <[email protected]> wrote: > Hi Mark, > > Thank you for your comments. > > Regarding the monitoring. We use the Diamond ES collector which saves > metrics every 30 seconds in Graphite. ElasticHQ is nice, but does > diagnostics calculations for the whole runtime of the cluster instead of > last X minutes. It does have nice diagnostics rules, so I created Graphite > dashboards for them. Marvel is surely nice, but with exception of Sense it > does not offer me anything I do not already have with Graphite. > > New finds: > * Setting index.codec.bloom.load=false on yesterdays/older indices frees > up memory from the fielddata pool. This stays released even when searching. > * Closing older indices speeds up indexing & refreshing. > > Regarding the closing benefit. The impact on refreshing is great! But from > a functional point-of-view its bad. I know about the 'overhead per index', > but cannot find a solution to this. > > Does anyone know how to get an ELK stack with "unlimited" retention? > > Regards, > Renzo > > > > Op woensdag 16 april 2014 11:15:32 UTC+2 schreef Mark Walkom: >> >> Well once you go over 31-32GB of heap you lose pointer compression which >> can actually slow you down. You might be better off reducing that and >> running multiple instances per physical. >> >> >0.90.4 or so compression is on by default, so no need to specify that. >> You might also want to change shards to a factor of your nodes, eg 3, 6, 9 >> for more even allocation. >> Also try moving to java 1.7u25 as that is the general agreed version to >> run. We run u51 with no issues though so that might be worth trialling if >> you can. >> >> Finally, what are you using to monitor the actual cluster? Something like >> ElasticHQ or Marvel will probably provide greater insights into what is >> happening and what you can do to improve performance. >> >> Regards, >> Mark Walkom >> >> Infrastructure Engineer >> Campaign Monitor >> email: [email protected] >> web: www.campaignmonitor.com >> >> >> On 16 April 2014 19:06, R. Toma <[email protected]> wrote: >> >>> Hi all, >>> >>> At bol.com we use ELK for a logsearch platform, using 3 machines. >>> >>> We need fast indexing (to not loose events) and want fast & near >>> realtime search. The search is currently not fast enough. Simple "give me >>> the last 50 events from the last 15 minutes, from any type, from todays >>> indices, without any terms" search queries may take 1.0 sec. Sometimes even >>> passing 30 seconds. >>> >>> It currently does 3k docs added per second, but we expect 8k/sec end of >>> this year. >>> >>> I have included lots of specs/config at bottom of this e-mail. >>> >>> >>> We found 2 reliable knobs to turn: >>> >>> 1. index.refresh_interval. At 1 sec fast search seems impossible. >>> When upping the refresh to 5 sec, search gets faster. At 10 sec its even >>> faster. But when you search during the refresh (wouldn't a splay be >>> nice?) >>> its slow again. And a refresh every 10 seconds is not near realtime >>> anymore. No obvious bottlenecks present: cpu, network, memory, disk i/o >>> all >>> OK. >>> 2. deleting old indices. No clue why this improves things. And we >>> really do not want to delete old data, since we want to keep at least 60 >>> days of data online. But after deleting old data to search speed slowly >>> crawls back up again... >>> >>> >>> We have zillions of metrics ("measure everything") of OS, ES and JVM >>> using Diamond and Graphite. Too much to include here. >>> We use a nagios check to simulates Kibana queries to monitor the search >>> speed every 5 minute. >>> >>> >>> When comparing behaviour at refresh_interval 1s vs 5s we see: >>> >>> - system% cpu load: depends per server: 150 vs 80, 100 vs 50, 40 vs >>> 25 == lower >>> - ParNew GC run freqency: 1 vs 0.6 (per second) == less >>> - GMS GC run frequency: 1 vs 4 (per hour) == more >>> - avg index time: 8 vs 2.5 (ms) == lower >>> - refresh frequency: 22 vs 12 (per second) -- still high numbers at >>> 5 sec because we have 17 active indices every day == less >>> - merge frequency: 12 vs 7 (per second) == less >>> - flush frequency: no difference >>> - search speed: at 1s way too slow, at 5s (at tests timed between >>> the refresh bursts) search calls ~50ms. >>> >>> >>> We already looked at the threadpools: >>> >>> - we increased the bulk pool >>> - we currently do not have any rejects in any pools >>> - only pool that has queueing (a spike per 1 or 2 hours) is the >>> 'management' pool (but thats probably Diamond) >>> >>> >>> We have a feeling something blocks/locks upon high index and high search >>> frequency. But what? I have looked at nearly all metrics and _cat output. >>> >>> >>> Our current list of untested/wild ideas: >>> >>> - Is the index.codec.bloom.load=false on yesterday's indices really >>> the magic bullet? We haven't tried it. >>> - Adding a 2nd JVM per machine is an option, but as long as we do >>> not know the real cause its not a real option (yet). >>> - Lowering the heap from 48GB to 30GB, to avoid the 64-bit overhead. >>> >>> >>> What knobs do you suggest we start turning? >>> >>> Any help is much appreciated! >>> >>> >>> A little present from me in return: I suggest you read >>> http://www.elasticsearch.org/guide/en/elasticsearch/ >>> reference/current/modules-scripting.html and decide if you need dynamic >>> scripting enabled (the default) as it allows for remote code execution via >>> the rest api. Credits go to Byron at Trifork! >>> >>> >>> >>> More details: >>> >>> Versions: >>> >>> - ES 1.0.1 on: java version "1.7.0_17", Java(TM) SE Runtime >>> Environment (build 1.7.0_17-b02), Java HotSpot(TM) 64-Bit Server VM >>> (build >>> 23.7-b01, mixed mode) >>> - Logstash 1.1.13 (with a backported elasticsearch_http plugin, for >>> idle_flush_time support) >>> - Kibana 2 >>> >>> >>> Setup: >>> >>> - we use several types of shippers/feeders, all sending logging to a >>> set of redis servers (the log4j and accesslog shippers/feeders use the >>> logstash json format to avoid grokking at logstash side) >>> - several logstash instances consume the redis list, process and >>> store in ES using the bulk API (we use bulk because we dislike the >>> version >>> lockin using the native transport) >>> - we use bulk async (we thought it would speed up indexing, which it >>> doesn't) >>> - we use bulk batch size of 1000 and idle flush of 1.0 second >>> >>> >>> Hardware for ES: >>> >>> - 3x HP 360G8 24x core >>> - each machine has 256GB RAM (1 ES jvm running per machine with 48GB >>> heap, so lots of free RAM for caching) >>> - each machine has 8x 1TB SAS (1 for OS and 7 as separate disks for >>> use in ES' -Des.path.data=....) >>> >>> >>> Logstash integration: >>> >>> - using Bulk API, to avoid the version lockin (maybe slower, which >>> we can fix by scaling out / adding more logstash instances) >>> - 17 new indices every day (e.g. syslog, accesslogging, log4j + >>> stacktraces) >>> >>> >>> ES configuration: >>> >>> - ES_HEAP_SIZE: 48gb >>> - index.number_of_shards: 5 >>> - index.number_of_replicas: 1 >>> - index.refresh_interval: 1s >>> - index.store.compress.stored: true >>> - index.translog.flush_threshold_ops: 50000 >>> - indices.memory.index_buffer_size: 50% >>> - default index mapping >>> >>> >>> Regards, >>> Renzo Toma >>> Bol.com >>> >>> >>> p.s. we are hiring! :-) >>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit https://groups.google.com/d/ >>> msgid/elasticsearch/0da7e8fc-813b-4755-9fea-a49bc9eac1b6% >>> 40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/0da7e8fc-813b-4755-9fea-a49bc9eac1b6%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/7dd33f9e-28b0-4308-b6c6-59cc01bd302e%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/7dd33f9e-28b0-4308-b6c6-59cc01bd302e%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624aD1tJT3vh8SkhRCab4Jyzmex_5YBbkFNdMoxpnGK6%2BRQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
