[influxdb] Re: High amount of RAM with multiple databases

mark Mon, 05 Dec 2016 16:31:38 -0800

On Monday, December 5, 2016 at 2:56:23 AM UTC-8, Marco B. wrote:
> Hello everyone,
> 
> We are currently testing a single InfluxDB (1.1) instance with some specific 
> use cases. One of these involves testing with multiple databases and graphite 
> input listeners.
> 
> It's a 2xlarge EC2 instance with 32GB of RAM, 8CPU and 8GB of disk (more 
> about this later).
> We have 50 Graphite Input, each of them listening on a different (TCP) port, 
> and 50 databases. 
> Our testing suite is sending almost 5 requests per second and the number of 
> series (per database) is around 4100, while the whole amount of data sent per 
> request is slightly more than 220KB, meaning about 1.3MB processed every 
> second. 
> 
> This is a sample of graphite plugin used:
> 
> [[graphite]]
>   enabled = false
>   database = "graphitedb"
>   bind-address = ":2003"
>   protocol = "tcp"
>   # consistency-level = "one"
> 
>   # These next lines control how batching works. You should have this enabled
>   # otherwise you could get dropped metrics or poor performance. Batching
>   # will buffer points in memory if you have many coming in.
> 
>   batch-size = 5000 # will flush if this many points get buffered
>   batch-pending = 100 # number of batches that may be pending in memory
>   # batch-timeout = "1s" # will flush at least this often even if we haven't 
> hit buffer limit
>   # udp-read-buffer = 0 # UDP Read buffer size, 0 means OS default. UDP 
> listener will fail if set above OS max.
> 
>   ### This string joins multiple matching 'measurement' values providing more 
> control over the final measurement name.
>   # separator = "."
> 
>   ### Default tags that will be added to all metrics.  These can be 
> overridden at the template level
>   ### or by tags extracted from metric
>   # tags = ["region=us-east", "zone=1c"]
> 
>   ### Each template line requires a template pattern.  It can have an optional
>   ### filter before the template and separated by spaces.  It can also have 
> optional extra
>   ### tags following the template.  Multiple tags should be separated by 
> commas and no spaces
>   ### similar to the line protocol format.  There can be only one default 
> template.
>   templates = [
>      # filter + template
>      #"*.app env.service.resource.measurement",
>      # filter + template + extra tag
>      #"stats.* .host.measurement* region=us-west,agent=sensu",
>      # default template. Ignore the first graphite component "servers"
>      "instance.profile.measurement*"
>  ]
> 
> Now, this is what we get from SAR:
> 
> 01:30:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   
> %commit
> 01:40:01 PM  30571056   2380184      7.22     72312    875772   1779084      
> 5.40
> 02:40:01 PM  28261688   4689552     14.23    123948   1889348   2619860      
> 7.95
> 03:40:01 PM  24316120   8635120     26.21    141296   2969536   6085648     
> 18.47
> 04:40:01 PM  22115388  10835852     32.88    145644   4384172   6168988     
> 18.72
> 05:40:01 PM  27836992   5114248     15.52    145692   3926444   6452356     
> 19.58 <------- testing suite stopped sending data here
> 06:40:01 PM  27695964   5255276     15.95    145728   3919372   6468408     
> 19.63
> 07:40:01 PM  21548288  11402952     34.61    146956   5394916   6467208     
> 19.63 <------- testing suite restarted by this point
> 08:40:01 PM  22852844  10098396     30.65    147356   6037624   6467208     
> 19.63
> 09:40:01 PM  21208420  11742820     35.64    147480   6697076   6467180     
> 19.63
> 10:40:01 PM  18228964  14722276     44.68    148908   7299444   7085052     
> 21.50
> 11:40:01 PM  13169316  19781924     60.03    148912   7299464  12149848     
> 36.87
> 
> and from here on, around 5 AM (the next day) it reaches 99% eventually to 
> crash. The CPU usage is also reported:
> 
> 
> 01:30:01 PM     CPU     %user     %nice   %system   %iowait    %steal     
> %idle
> 01:40:01 PM     all      5.06      0.00      0.11      0.13      0.08     
> 94.62
> 02:40:01 PM     all      5.11      0.00      0.13      0.08      0.09     
> 94.59
> 03:40:01 PM     all      6.66      0.00      0.13      0.14      0.10     
> 92.96
> 04:40:01 PM     all      6.10      0.00      0.12      0.14      0.09     
> 93.54
> 05:40:01 PM     all      0.07      0.00      0.00      0.00      0.04     
> 99.89   <------- testing suite stopped sending data here
> 06:40:01 PM     all      0.11      0.00      0.00      0.00      0.03     
> 99.85
> 07:40:01 PM     all      5.07      0.00      0.10      0.08      0.07     
> 94.68  <------- testing suite restarted by this point
> 08:40:01 PM     all      4.91      0.00      0.10      0.09      0.09     
> 94.82
> 09:40:01 PM     all      5.29      0.00      0.14      0.22      0.09     
> 94.25
> 10:40:01 PM     all     99.33      0.00      0.25      0.00      0.00      
> 0.41
> 11:40:01 PM     all     99.47      0.00      0.23      0.00      0.00      
> 0.30
> 
> The CPU peak can be explained by the fact that at a certain point we get 
> failures due to full disk. While I certainly expect a similar behavior, I 
> don't get the reason that memory usage grows almost exponentially, even 
> without failures (at least not logged), in less than 10 hours (from 2 GB to 
> 12-14GB). The query time was quite fast, so I expect that everything was 
> running smoothly, but the memory usage just doesn't match my calculations:
> 
> Each graphite plugin holds up to 10 batches pending, meaning 220K * 10 * 50 ~ 
> 110MB. The cache size is the default one, ~ 28MB, while the total amount of 
> memory is 1GB. Moreover, the amount of data stored is less than 7GB, so even 
> if everything was cached, it should still fit in 32GB, right?
> 
> Is there anything else we should know? Config params? Too many databases? is 
> TCP not appropriate for this? What are we exactly doing wrong? Was this due 
> to the fact that the disk was full? Even then, why is memory growing so fast?
> 
> Thanks a lot in advance!
> 
> Kind regards,
> Marco


With InfluxDB and any Go program, the memory usage reported by the operating 
system can be grossly misleading. Essentially, the program may allocate a lot 
of memory during normal use, and then the Go runtime tells the operating 
system, "I don't need this chunk of memory any more, you can have it back so 
other programs can use it", and the operating system says, "Nobody else is 
using much memory right now, so keep it in case you need it again later."

In the future, if you want a more accurate insight into how InfluxDB is using 
memory, check the runtime measurement in the _internal database, which contains 
memory stats[1] as reported by the Go runtime.

[1] https://beta.golang.org/pkg/runtime/#MemStats

-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/609f22e8-9286-4026-9006-410c8ffc9e92%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[influxdb] Re: High amount of RAM with multiple databases

Reply via email to