tl;dr While stress testing InfluxEnterprise, we saw unexpected memory behaviour 
that we would like to explain.

Yesterday, I spent some time stressing our self-hosted 2-node InfluxEnterprise 
cluster with a high amount of writes to a test database.  This cluster's data 
nodes are two AWS m4.large instances, which have 2 VCPUs and 8GB of memory 
each.  The servers run Debian 8 (Linux 3.16.0-4-amd64 #1 SMP Debian 
3.16.7-ckt11-1 (2015-05-24) x86_64 GNU/Linux).  The servers are used for 
nothing else.

The database was specifically created for running the test.  The measurements 
and series stored in that database are defined here in the test script: 
https://gist.github.com/goakley/2bc6503f3bcfbb25270429fa2b3c9e4b  Note the low 
series cardinality for the database.  This database lives in the same cluster 
as a `telegraf` database which has a cardinality of 13k.

To stress the system, we ran the test script at 10k writes/s on three different 
machines. Here is a graph showing the memory and CPU utilization during 
testing: http://i.imgur.com/F0GeDT8.png

Refer to the times on the above image.  The test started at 1600, with the 
approximate 30k/s write throughput. As expected, CPU utilization was maxed out 
in an attempt to keep up with the write throughput. At around 1630, our scripts 
started to time out while writing to InfluxDB (with a 4 second wait time), 
however data was still flowing to the system.  The scripts remained in this 
state for the rest of the test.  At 16:45, we saw the unexpected spike in 
memory usage until one of the machine's InfluxDB process was killed, 
automatically restarted, and recovered all unwritten data.  Shortly after we 
stopped the scripts, and the cluster recovered.

Here is the log from the server that maxed out its memory during the time at 
which that event occurred: 
https://drive.google.com/file/d/0B5o3UEMmVkdIQVgxM0VuTlR0Z3M/view?usp=sharing

The question here is why InfluxDB suddenly started consuming all the system 
memory.  It was at a time after which the test had been running for a while.

I see some possible reasons suggested in this InfluxDB feature request: 
https://github.com/influxdata/influxdb/issues/7142  I am wondering if anyone 
else has seem a similar pattern and can explain the behaviour.

-- 
Remember to include the InfluxDB version number with all issue reports
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/788decb8-5da8-4995-8128-d6b47e0cbc31%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to