tl;dr While stress testing InfluxEnterprise, we saw unexpected memory behaviour that we would like to explain.
Yesterday, I spent some time stressing our self-hosted 2-node InfluxEnterprise cluster with a high amount of writes to a test database. This cluster's data nodes are two AWS m4.large instances, which have 2 VCPUs and 8GB of memory each. The servers run Debian 8 (Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1 (2015-05-24) x86_64 GNU/Linux). The servers are used for nothing else. The database was specifically created for running the test. The measurements and series stored in that database are defined here in the test script: https://gist.github.com/goakley/2bc6503f3bcfbb25270429fa2b3c9e4b Note the low series cardinality for the database. This database lives in the same cluster as a `telegraf` database which has a cardinality of 13k. To stress the system, we ran the test script at 10k writes/s on three different machines. Here is a graph showing the memory and CPU utilization during testing: http://i.imgur.com/F0GeDT8.png Refer to the times on the above image. The test started at 1600, with the approximate 30k/s write throughput. As expected, CPU utilization was maxed out in an attempt to keep up with the write throughput. At around 1630, our scripts started to time out while writing to InfluxDB (with a 4 second wait time), however data was still flowing to the system. The scripts remained in this state for the rest of the test. At 16:45, we saw the unexpected spike in memory usage until one of the machine's InfluxDB process was killed, automatically restarted, and recovered all unwritten data. Shortly after we stopped the scripts, and the cluster recovered. Here is the log from the server that maxed out its memory during the time at which that event occurred: https://drive.google.com/file/d/0B5o3UEMmVkdIQVgxM0VuTlR0Z3M/view?usp=sharing The question here is why InfluxDB suddenly started consuming all the system memory. It was at a time after which the test had been running for a while. I see some possible reasons suggested in this InfluxDB feature request: https://github.com/influxdata/influxdb/issues/7142 I am wondering if anyone else has seem a similar pattern and can explain the behaviour. -- Remember to include the InfluxDB version number with all issue reports --- You received this message because you are subscribed to the Google Groups "InfluxDB" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/influxdb. To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/788decb8-5da8-4995-8128-d6b47e0cbc31%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
