I will have to prepare the logfile as it has some confidential information 
in it, but I will post the basic rundown of what is happening.
Our situation is that the circuit breakers do not seem to keep us from 
throwing an OOM/"stop-the-world" GC event, causing the node(s) to become 
unresponsive and very quickly bringing down our cluster. We have seen this 
happen once a day for the last week. The little background I can give you 
without posting the log file is that it seems like a large query comes in 
and one node gets an OOM while the other nodes trigger the circuit 
breakers. It would be great if the OOM node would come back up and not 
bring down our cluster however that is not the case.  

We have 3 master nodes, 26 data only nodes and 1 client node in production.


1. Can someone who has experimented with the circuit breakers give me some 
feedback as to why we are still getting OOMs related to a specific api 
request even if we set all 3 circuit breakers to 1%?
2. Circuit Breakers seem to only work against single queries (not a single 
api request) which does not help much when it comes to an enterprise 
solution like ours. Is this a correct assumption? 
3. Is there anything I can do on each node to ensure that we avoid OOMs? 

a.Change the max heap size? 
b.Change to G1GC? 
c.Change the setting index.cache.field.type to soft to allow for more 
aggressive GC? 
d.Change the following JVM option settings CMSInitiatingOccupancyFraction 
and UseCMSInitiatingOccupancyOnly?

 
Thanks,
Will

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8289fbd0-5a1f-4a15-b718-4dd5fbff1f3a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to