Re: ES v1.1 continuous young gc pauses old gc, stops the world when old gc happens and splits cluster

Ankush Jhalani Fri, 20 Jun 2014 06:47:46 -0700

Mike - The above sounds like happened due to machines sending too many 
indexing requests and merging unable to keep up pace. Usual suspects would 
be not enough cpu/disk speed bandwidth. 
This doesn't sound related to memory constraints posted in the original 
issue of this thread. Do you see memory GC traces in logs?


On Friday, June 20, 2014 9:40:48 AM UTC-4, Michael Hart wrote:
>
> We're seeing the same thing. ES 1.1.0, JDK 7u55 on Ubuntu 12.04, 5 data 
> nodes, 3 separate masters, all are 15GB hosts with 7.5GB Heaps, storage is 
> SSD. Data set is ~1.6TB according to Marvel.
>
> Our daily indices are roughly 33GB in size, with 5 shards and 2 replicas. 
> I'm still investigating what happened yesterday, but I do see in Marvel a 
> large spike in the "Indices Current Merges" graph just before the node 
> dies, and a corresponding increase in JVM Heap. When Heap hits 99% 
> everything grinds to a halt. Restarting the node "fixes" the issue, but 
> this is third or fourth time it's happened.
>
> I'm still researching how to deal with this, but a couple of things I am 
> looking at are:
>
>    - increase the number of shards so that the segment merges stay 
>    smaller (is that even a legitimate sentence?) I'm still reading through 
>    this page the Index Module Merge page 
>    
> <http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-merge.html>
>  for 
>    more details.
>    - look at store level throttling 
>    
> <http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-store.html#store-throttling>
>    .
>
> I would love to get some feedback on my ramblings. If I find anything more 
> I'll update this thread.
>
> cheers
> mike
>
>
>
>
> On Thursday, June 19, 2014 4:05:54 PM UTC-4, Bruce Ritchie wrote:
>
> Java 8 with G1GC perhaps? It'll have more overhead but perhaps it'll be 
> more consistent wrt pauses.
>
>
>
> On Wednesday, June 18, 2014 2:02:24 PM UTC-4, Eric Brandes wrote:
>
> I'd just like to chime in with a "me too".  Is the answer just more 
> nodes?  In my case this is happening every week or so.
>
> On Monday, April 21, 2014 9:04:33 PM UTC-5, Brian Flad wrote:
>
> My dataset currently is 100GB across a few "daily" indices (~5-6GB and 15 
> shards each). Data nodes are 12 CPU, 12GB RAM (6GB heap).
>
>
> On Mon, Apr 21, 2014 at 6:33 PM, Mark Walkom <[email protected]> 
> wrote:
>
> How big are your data sets? How big are your nodes?
>
> Regards,
> Mark Walkom
>
> Infrastructure Engineer
> Campaign Monitor
> email: [email protected]
> web: www.campaignmonitor.com
>
>
> On 22 April 2014 00:32, Brian Flad <[email protected]> wrote:
>
> We're seeing the same behavior with 1.1.1, JDK 7u55, 3 master nodes (2 min 
> master), and 5 data nodes. Interestingly, we see the repeated young GCs 
> only on a node or two at a time. Cluster operations (such as recovering 
> unassigned shards) grinds to a halt. After restarting a GCing node, 
> everything returns to normal operation in the cluster.
>
> Brian F
>
>
> On Wed, Apr 16, 2014 at 8:00 PM, Mark Walkom <[email protected]> 
> wrote:
>
> In both your instances, if you can, have 3 master eligible nodes as it 
> will reduce the likelihood of a split cluster as you will always have a 
> majority quorum. Also look at discovery.zen.minimum_master_nodes to go with 
> that.
> However you may just be reaching the limit of your nodes, which means the 
> best option is to add another node (which also neatly solves your split 
> brain!).
>
> Ankush it would help if you can update java, most people recommend u25 but 
> we run u51 with no problems.
>
>
>
> Regards,
> Mark Walkom
>
> Infrastructure Engineer
> Campaign Monitor
> email: [email protected]
> web: www.campaignmonitor.com
>
>
> On 17 April 2014 07:31, Dominiek ter Heide <[email protected]> wrote:
>
> We are seeing the same issue here. 
>
> Our environment:
>
> - 2 nodes
> - 30GB Heap allocated to ES
> - ~140GB of data
> - 639 indices, 10 shards per index
> - ~48M documents
>
> After starting ES everything is good, but after a couple of hours we see 
> the Heap build up towards 96% on one node and 80% on the other. We then see 
> the GC take very long on the 96% node:
>
>
>
>
>
>
>
>
>
> TOuKgmlzaVaFVA][elasticsearch1.trend1.bottlenose.com][inet[/192.99.45.125:
> 9300]]])
>
> [2014-04-16 12:04:27,845][INFO ][discovery                ] 
> [elasticsearch2.trend1] trend1/I3EHG_XjSayz2OsHyZpeZA
>
> [2014-04-16 12:04:27,850][INFO ][http                     ] [
> elasticsearch2.trend1] bound_address {inet[/0.0.0.0:9200]}, 
> publish_address {inet[/192.99.45.126:9200]}
>
> [2014-04-16 12:04:27,851][INFO ][node                     ] 
> [elasticsearch2.trend1] started
>
> [2014-04-16 12:04:32,669][INFO ][indices.store            ] 
> [elasticsearch2.trend1] updating indices.store.throttle.max_bytes_per_sec 
> from [20mb] to [1gb], note, type is [MERGE]
>
> [2014-04-16 12:04:32,669][INFO ][cluster.routing.allocation.decider] 
> [elasticsearch2.trend1] updating 
> [cluster.routing.allocation.node_initial_primaries_recoveries] from [4] 
> to [50]
>
> [2014-04-16 12:04:32,670][INFO ][indices.recovery         ] 
> [elasticsearch2.trend1] updating [indices.recovery.max_bytes_per_sec] from 
> [200mb] to [2gb]
>
> [2014-04-16 12:04:32,670][INFO ][cluster.routing.allocation.decider] 
> [elasticsearch2.trend1] updating 
> [cluster.routing.allocation.node_initial_primaries_recoveries] from [4] 
> to [50]
>
> [2014-04-16 12:04:32,670][INFO ][cluster.routing.allocation.decider] 
> [elasticsearch2.trend1] updating 
> [cluster.routing.allocation.node_initial_primaries_recoveries] from [4] 
> to [50]
>
> [2014-04-16 15:25:21,409][WARN ][monitor.jvm              ] 
> [elasticsearch2.trend1] [gc][old][11876][106] duration [1.1m], 
> collections [1]/[1.1m], total [1.1m]/[1.4m], memory [28.7gb]->[22gb]/[
> 29.9gb], all_pools {[young] [67.9mb]->[268.9mb]/[665.6mb]}{[survivor] [
> 60.5mb]->[0b]/[83.1mb]}{[old] [28.6gb]->[21.8gb]/[29.1gb]}
>
> [2014-04-16 16:02:32,523][WARN ][monitor.jvm              ] [
> elasticsearch2.trend1] [gc][old][13996][144] duration [1.4m], collections 
> [1]/[1.4m], total [1.4m]/[3m], memory [28.8gb]->[23.5gb]/[29.9gb], 
> all_pools {[young] [21.8mb]->[238.2mb]/[665.6mb]}{[survivor] [82.4mb]->[0b
> ]/[83.1mb]}{[old] [28.7gb]->[23.3gb]/[29.1gb]}
>
> [2014-04-16 16:14:12,386][WARN ][monitor.jvm              ] [
> elasticsearch2.trend1] [gc][old][14603][155] duration [1.3m], collections 
> [2]/[1.3m], total [1.3m]/[4.4m], memory [29.2gb]->[23.9gb]/[29.9gb], 
> all_pools {[young] [289mb]->[161.3mb]/[665.6mb]}{[survivor] [58.3mb]->[0b
> ]/[83.1mb]}{[old] [28.8gb]->[23.8gb]/[29.1gb]}
>
> [2014-04-16 16:17:55,480][WARN ][monitor.jvm              ] [
> elasticsearch2.trend1] [gc][old][14745][158] duration [1.3m], collections 
> [1]/[1.3m], total [1.3m]/[5.7m], memory [29.7gb]->[24.1gb]/[29.9gb], 
> all_pools {[young] [633.8mb]->[149.7mb]/[665.6mb]}{[survivor] [68.6mb]->[
> 0b]/[83.1mb]}{[old] [29gb]->[24gb]/[29.1gb]}
>
> [2014-04-16 16:21:17,950][WARN ][monitor.jvm              ] [
> elasticsearch2.trend1] [gc][old][14857][161] duration [1.4m], collections 
> [1]/[1.4m], total [1.4m]/[7.2m], memory [28.6gb]->[24.5gb]/[29.9gb], 
> all_pools {[young] [27.7mb]->[154.8mb]/[665.6mb]}{[survivor] [83.1mb]->[0b
> ]/[83.1mb]}{[old] [28.5gb]->[24.3gb]/[29.1gb]}
>
> [2014-04-16 16:24:48,776][WARN ][monitor.jvm              ] [
> elasticsearch2.trend1] [gc][old][14978][164] duration [1.4m], collections 
> [1]/[1.4m], total [1.4m]/[8.6m], memory [29.4gb]->[24.7gb]/[29.9gb], 
> all_pools {[young] [475.5mb]->[125.1mb]/[665.6mb]}{[survivor] [68.9mb]->[
> 0b]/[83.1mb]}{[old] [28.9gb]->[24.6gb]/[29.1gb]}
>
> [2014-04-16 16:26:54,801][WARN ][monitor.jvm              ] [
> elasticsearch2.trend1] [gc][old][15021][165] duration [1.3m], collections 
> [1]/[1.3m], total [1.3m]/[9.9m], memory [29.3gb]->[24.8gb]/[29.9gb], 
> all_pools {[young] [391.8mb]->[151.1mb]/[665.6mb]}{[survivor] [62.4mb]->[
> 0b]/[83.1mb]}{[old] [28.9gb]->[24.6gb]/[29.1gb]}
>
> [2014-04-16 16:30:45,393][WARN ][monitor.jvm              ] [
> elasticsearch2.trend1] [gc][old][15170][168] duration [1.3m], collections 
> [<span style="color:#066
>
> ...

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9055ca7f-9987-4206-b14e-2319d080cded%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: ES v1.1 continuous young gc pauses old gc, stops the world when old gc happens and splits cluster

Reply via email to