Hello everyone,
We are currently testing a single InfluxDB (1.1) instance with some
specific use cases. One of these involves testing with multiple databases
and graphite input listeners
<https://github.com/influxdata/influxdb/blob/master/services/graphite/README.md>
.
It's a 2xlarge EC2 instance with 32GB of RAM, 8CPU and 8GB of disk (more
about this later).
We have 50 Graphite Input, each of them listening on a different (TCP)
port, and 50 databases.
Our testing suite is sending almost 5 requests per second and the number of
series (per database) is around 4100, while the whole amount of data sent
per request is slightly more than 220KB, meaning about 1.3MB processed
every second.
This is a sample of graphite plugin used:
[[graphite]]
enabled = false
database = "graphitedb"
bind-address = ":2003"
protocol = "tcp"
# consistency-level = "one"
# These next lines control how batching works. You should have this
enabled
# otherwise you could get dropped metrics or poor performance. Batching
# will buffer points in memory if you have many coming in.
batch-size = 5000 # will flush if this many points get buffered
batch-pending = 100 # number of batches that may be pending in memory
# batch-timeout = "1s" # will flush at least this often even if we
haven't hit buffer limit
# udp-read-buffer = 0 # UDP Read buffer size, 0 means OS default. UDP
listener will fail if set above OS max.
### This string joins multiple matching 'measurement' values providing
more control over the final measurement name.
# separator = "."
### Default tags that will be added to all metrics. These can be
overridden at the template level
### or by tags extracted from metric
# tags = ["region=us-east", "zone=1c"]
### Each template line requires a template pattern. It can have an
optional
### filter before the template and separated by spaces. It can also have
optional extra
### tags following the template. Multiple tags should be separated by
commas and no spaces
### similar to the line protocol format. There can be only one default
template.
templates = [
# filter + template
#"*.app env.service.resource.measurement",
# filter + template + extra tag
#"stats.* .host.measurement* region=us-west,agent=sensu",
# default template. Ignore the first graphite component "servers"
"instance.profile.measurement*"
]
Now, this is what we get from SAR:
01:30:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit
%commit
01:40:01 PM 30571056 2380184 7.22 72312 875772
1779084 5.40
02:40:01 PM 28261688 4689552 14.23 123948 1889348
2619860 7.95
03:40:01 PM 24316120 8635120 26.21 141296 2969536 6085648
18.47
04:40:01 PM 22115388 10835852 32.88 145644 4384172 6168988
18.72
05:40:01 PM 27836992 5114248 15.52 145692 3926444 6452356
19.58 <------- testing suite stopped sending data here
06:40:01 PM 27695964 5255276 15.95 145728 3919372 6468408
19.63
07:40:01 PM 21548288 11402952 34.61 146956 5394916 6467208
19.63 <------- testing suite restarted by this point
08:40:01 PM 22852844 10098396 30.65 147356 6037624 6467208
19.63
09:40:01 PM 21208420 11742820 35.64 147480 6697076 6467180
19.63
10:40:01 PM 18228964 14722276 44.68 148908 7299444 7085052
21.50
11:40:01 PM 13169316 19781924 60.03 148912 7299464 12149848
36.87
and from here on, around 5 AM (the next day) it reaches 99% eventually to
crash. The CPU usage is also reported:
01:30:01 PM CPU %user %nice %system %iowait %steal
%idle
01:40:01 PM all 5.06 0.00 0.11 0.13 0.08
94.62
02:40:01 PM all 5.11 0.00 0.13 0.08 0.09
94.59
03:40:01 PM all 6.66 0.00 0.13 0.14 0.10
92.96
04:40:01 PM all 6.10 0.00 0.12 0.14 0.09
93.54
05:40:01 PM all 0.07 0.00 0.00 0.00 0.04
99.89 <------- testing suite stopped sending data here
06:40:01 PM all 0.11 0.00 0.00 0.00 0.03
99.85
07:40:01 PM all 5.07 0.00 0.10 0.08 0.07
94.68 <------- testing suite restarted by this point
08:40:01 PM all 4.91 0.00 0.10 0.09 0.09
94.82
09:40:01 PM all 5.29 0.00 0.14 0.22 0.09
94.25
10:40:01 PM all 99.33 0.00 0.25 0.00 0.00
0.41
11:40:01 PM all 99.47 0.00 0.23 0.00 0.00
0.30
The CPU peak can be explained by the fact that at a certain point we get
failures due to full disk. While I certainly expect a similar behavior, I
don't get the reason that memory usage grows almost exponentially, even
without failures (at least not logged), in less than 10 hours (from 2 GB to
12-14GB). The query time was quite fast, so I expect that everything was
running smoothly, but the memory usage just doesn't match my calculations:
Each graphite plugin holds up to 10 batches pending, meaning 220K * 10 * 50
~ 110MB. The cache size is the default one, ~ 28MB, while the total amount
of memory is 1GB. Moreover, the amount of data stored is less than 7GB, so
even if everything was cached, it should still fit in 32GB, right?
Is there anything else we should know? Config params? Too many databases?
is TCP not appropriate for this? What are we exactly doing wrong? Was this
due to the fact that the disk was full? Even then, why is memory growing so
fast?
Thanks a lot in advance!
Kind regards,
Marco
--
Remember to include the version number!
---
You received this message because you are subscribed to the Google Groups
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit
https://groups.google.com/d/msgid/influxdb/6d58ad7f-1dcf-4e69-b0ba-b2f97f07639e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.