Hello everyone,

We are currently testing a single InfluxDB (1.1) instance with some 
specific use cases. One of these involves testing with multiple databases 
and graphite input listeners 
<https://github.com/influxdata/influxdb/blob/master/services/graphite/README.md>
.

It's a 2xlarge EC2 instance with 32GB of RAM, 8CPU and 8GB of disk (more 
about this later).
We have 50 Graphite Input, each of them listening on a different (TCP) 
port, and 50 databases. 
Our testing suite is sending almost 5 requests per second and the number of 
series (per database) is around 4100, while the whole amount of data sent 
per request is slightly more than 220KB, meaning about 1.3MB processed 
every second. 

This is a sample of graphite plugin used:

[[graphite]]
  enabled = false
  database = "graphitedb"
  bind-address = ":2003"
  protocol = "tcp"
  # consistency-level = "one"

  # These next lines control how batching works. You should have this 
enabled
  # otherwise you could get dropped metrics or poor performance. Batching
  # will buffer points in memory if you have many coming in.

  batch-size = 5000 # will flush if this many points get buffered
  batch-pending = 100 # number of batches that may be pending in memory
  # batch-timeout = "1s" # will flush at least this often even if we 
haven't hit buffer limit
  # udp-read-buffer = 0 # UDP Read buffer size, 0 means OS default. UDP 
listener will fail if set above OS max.

  ### This string joins multiple matching 'measurement' values providing 
more control over the final measurement name.
  # separator = "."

  ### Default tags that will be added to all metrics.  These can be 
overridden at the template level
  ### or by tags extracted from metric
  # tags = ["region=us-east", "zone=1c"]

  ### Each template line requires a template pattern.  It can have an 
optional
  ### filter before the template and separated by spaces.  It can also have 
optional extra
  ### tags following the template.  Multiple tags should be separated by 
commas and no spaces
  ### similar to the line protocol format.  There can be only one default 
template.
  templates = [
     # filter + template
     #"*.app env.service.resource.measurement",
     # filter + template + extra tag
     #"stats.* .host.measurement* region=us-west,agent=sensu",
     # default template. Ignore the first graphite component "servers"
     "instance.profile.measurement*"
 ]

Now, this is what we get from SAR:

01:30:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   
%commit
01:40:01 PM  30571056   2380184      7.22     72312    875772   
1779084      5.40
02:40:01 PM  28261688   4689552     14.23    123948   1889348   
2619860      7.95
03:40:01 PM  24316120   8635120     26.21    141296   2969536   6085648     
18.47
04:40:01 PM  22115388  10835852     32.88    145644   4384172   6168988     
18.72
05:40:01 PM  27836992   5114248     15.52    145692   3926444   6452356     
19.58 <------- testing suite stopped sending data here
06:40:01 PM  27695964   5255276     15.95    145728   3919372   6468408     
19.63
07:40:01 PM  21548288  11402952     34.61    146956   5394916   6467208     
19.63 <------- testing suite restarted by this point
08:40:01 PM  22852844  10098396     30.65    147356   6037624   6467208     
19.63
09:40:01 PM  21208420  11742820     35.64    147480   6697076   6467180     
19.63
10:40:01 PM  18228964  14722276     44.68    148908   7299444   7085052     
21.50
11:40:01 PM  13169316  19781924     60.03    148912   7299464  12149848     
36.87

and from here on, around 5 AM (the next day) it reaches 99% eventually to 
crash. The CPU usage is also reported:


01:30:01 PM     CPU     %user     %nice   %system   %iowait    %steal     
%idle
01:40:01 PM     all      5.06      0.00      0.11      0.13      0.08     
94.62
02:40:01 PM     all      5.11      0.00      0.13      0.08      0.09     
94.59
03:40:01 PM     all      6.66      0.00      0.13      0.14      0.10     
92.96
04:40:01 PM     all      6.10      0.00      0.12      0.14      0.09     
93.54
05:40:01 PM     all      0.07      0.00      0.00      0.00      0.04     
99.89   <------- testing suite stopped sending data here
06:40:01 PM     all      0.11      0.00      0.00      0.00      0.03     
99.85
07:40:01 PM     all      5.07      0.00      0.10      0.08      0.07     
94.68  <------- testing suite restarted by this point
08:40:01 PM     all      4.91      0.00      0.10      0.09      0.09     
94.82
09:40:01 PM     all      5.29      0.00      0.14      0.22      0.09     
94.25
10:40:01 PM     all     99.33      0.00      0.25      0.00      0.00      
0.41
11:40:01 PM     all     99.47      0.00      0.23      0.00      0.00      
0.30

The CPU peak can be explained by the fact that at a certain point we get 
failures due to full disk. While I certainly expect a similar behavior, I 
don't get the reason that memory usage grows almost exponentially, even 
without failures (at least not logged), in less than 10 hours (from 2 GB to 
12-14GB). The query time was quite fast, so I expect that everything was 
running smoothly, but the memory usage just doesn't match my calculations:

Each graphite plugin holds up to 10 batches pending, meaning 220K * 10 * 50 
~ 110MB. The cache size is the default one, ~ 28MB, while the total amount 
of memory is 1GB. Moreover, the amount of data stored is less than 7GB, so 
even if everything was cached, it should still fit in 32GB, right?

Is there anything else we should know? Config params? Too many databases? 
is TCP not appropriate for this? What are we exactly doing wrong? Was this 
due to the fact that the disk was full? Even then, why is memory growing so 
fast?

Thanks a lot in advance!

Kind regards,
Marco 

-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/6d58ad7f-1dcf-4e69-b0ba-b2f97f07639e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to