I will have to prepare the logfile as it has some confidential information in it, but I will post the basic rundown of what is happening. Our situation is that the circuit breakers do not seem to keep us from throwing an OOM/"stop-the-world" GC event, causing the node(s) to become unresponsive and very quickly bringing down our cluster. We have seen this happen once a day for the last week. The little background I can give you without posting the log file is that it seems like a large query comes in and one node gets an OOM while the other nodes trigger the circuit breakers. It would be great if the OOM node would come back up and not bring down our cluster however that is not the case.
We have 3 master nodes, 26 data only nodes and 1 client node in production. 1. Can someone who has experimented with the circuit breakers give me some feedback as to why we are still getting OOMs related to a specific api request even if we set all 3 circuit breakers to 1%? 2. Circuit Breakers seem to only work against single queries (not a single api request) which does not help much when it comes to an enterprise solution like ours. Is this a correct assumption? 3. Is there anything I can do on each node to ensure that we avoid OOMs? a.Change the max heap size? b.Change to G1GC? c.Change the setting index.cache.field.type to soft to allow for more aggressive GC? d.Change the following JVM option settings CMSInitiatingOccupancyFraction and UseCMSInitiatingOccupancyOnly? Thanks, Will -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8289fbd0-5a1f-4a15-b718-4dd5fbff1f3a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
