On Monday, December 5, 2016 at 2:56:23 AM UTC-8, Marco B. wrote: > Hello everyone, > > We are currently testing a single InfluxDB (1.1) instance with some specific > use cases. One of these involves testing with multiple databases and graphite > input listeners. > > It's a 2xlarge EC2 instance with 32GB of RAM, 8CPU and 8GB of disk (more > about this later). > We have 50 Graphite Input, each of them listening on a different (TCP) port, > and 50 databases. > Our testing suite is sending almost 5 requests per second and the number of > series (per database) is around 4100, while the whole amount of data sent per > request is slightly more than 220KB, meaning about 1.3MB processed every > second. > > This is a sample of graphite plugin used: > > [[graphite]] > enabled = false > database = "graphitedb" > bind-address = ":2003" > protocol = "tcp" > # consistency-level = "one" > > # These next lines control how batching works. You should have this enabled > # otherwise you could get dropped metrics or poor performance. Batching > # will buffer points in memory if you have many coming in. > > batch-size = 5000 # will flush if this many points get buffered > batch-pending = 100 # number of batches that may be pending in memory > # batch-timeout = "1s" # will flush at least this often even if we haven't > hit buffer limit > # udp-read-buffer = 0 # UDP Read buffer size, 0 means OS default. UDP > listener will fail if set above OS max. > > ### This string joins multiple matching 'measurement' values providing more > control over the final measurement name. > # separator = "." > > ### Default tags that will be added to all metrics. These can be > overridden at the template level > ### or by tags extracted from metric > # tags = ["region=us-east", "zone=1c"] > > ### Each template line requires a template pattern. It can have an optional > ### filter before the template and separated by spaces. It can also have > optional extra > ### tags following the template. Multiple tags should be separated by > commas and no spaces > ### similar to the line protocol format. There can be only one default > template. > templates = [ > # filter + template > #"*.app env.service.resource.measurement", > # filter + template + extra tag > #"stats.* .host.measurement* region=us-west,agent=sensu", > # default template. Ignore the first graphite component "servers" > "instance.profile.measurement*" > ] > > Now, this is what we get from SAR: > > 01:30:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit > %commit > 01:40:01 PM 30571056 2380184 7.22 72312 875772 1779084 > 5.40 > 02:40:01 PM 28261688 4689552 14.23 123948 1889348 2619860 > 7.95 > 03:40:01 PM 24316120 8635120 26.21 141296 2969536 6085648 > 18.47 > 04:40:01 PM 22115388 10835852 32.88 145644 4384172 6168988 > 18.72 > 05:40:01 PM 27836992 5114248 15.52 145692 3926444 6452356 > 19.58 <------- testing suite stopped sending data here > 06:40:01 PM 27695964 5255276 15.95 145728 3919372 6468408 > 19.63 > 07:40:01 PM 21548288 11402952 34.61 146956 5394916 6467208 > 19.63 <------- testing suite restarted by this point > 08:40:01 PM 22852844 10098396 30.65 147356 6037624 6467208 > 19.63 > 09:40:01 PM 21208420 11742820 35.64 147480 6697076 6467180 > 19.63 > 10:40:01 PM 18228964 14722276 44.68 148908 7299444 7085052 > 21.50 > 11:40:01 PM 13169316 19781924 60.03 148912 7299464 12149848 > 36.87 > > and from here on, around 5 AM (the next day) it reaches 99% eventually to > crash. The CPU usage is also reported: > > > 01:30:01 PM CPU %user %nice %system %iowait %steal > %idle > 01:40:01 PM all 5.06 0.00 0.11 0.13 0.08 > 94.62 > 02:40:01 PM all 5.11 0.00 0.13 0.08 0.09 > 94.59 > 03:40:01 PM all 6.66 0.00 0.13 0.14 0.10 > 92.96 > 04:40:01 PM all 6.10 0.00 0.12 0.14 0.09 > 93.54 > 05:40:01 PM all 0.07 0.00 0.00 0.00 0.04 > 99.89 <------- testing suite stopped sending data here > 06:40:01 PM all 0.11 0.00 0.00 0.00 0.03 > 99.85 > 07:40:01 PM all 5.07 0.00 0.10 0.08 0.07 > 94.68 <------- testing suite restarted by this point > 08:40:01 PM all 4.91 0.00 0.10 0.09 0.09 > 94.82 > 09:40:01 PM all 5.29 0.00 0.14 0.22 0.09 > 94.25 > 10:40:01 PM all 99.33 0.00 0.25 0.00 0.00 > 0.41 > 11:40:01 PM all 99.47 0.00 0.23 0.00 0.00 > 0.30 > > The CPU peak can be explained by the fact that at a certain point we get > failures due to full disk. While I certainly expect a similar behavior, I > don't get the reason that memory usage grows almost exponentially, even > without failures (at least not logged), in less than 10 hours (from 2 GB to > 12-14GB). The query time was quite fast, so I expect that everything was > running smoothly, but the memory usage just doesn't match my calculations: > > Each graphite plugin holds up to 10 batches pending, meaning 220K * 10 * 50 ~ > 110MB. The cache size is the default one, ~ 28MB, while the total amount of > memory is 1GB. Moreover, the amount of data stored is less than 7GB, so even > if everything was cached, it should still fit in 32GB, right? > > Is there anything else we should know? Config params? Too many databases? is > TCP not appropriate for this? What are we exactly doing wrong? Was this due > to the fact that the disk was full? Even then, why is memory growing so fast? > > Thanks a lot in advance! > > Kind regards, > Marco
With InfluxDB and any Go program, the memory usage reported by the operating system can be grossly misleading. Essentially, the program may allocate a lot of memory during normal use, and then the Go runtime tells the operating system, "I don't need this chunk of memory any more, you can have it back so other programs can use it", and the operating system says, "Nobody else is using much memory right now, so keep it in case you need it again later." In the future, if you want a more accurate insight into how InfluxDB is using memory, check the runtime measurement in the _internal database, which contains memory stats[1] as reported by the Go runtime. [1] https://beta.golang.org/pkg/runtime/#MemStats -- Remember to include the version number! --- You received this message because you are subscribed to the Google Groups "InfluxData" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/influxdb. To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/609f22e8-9286-4026-9006-410c8ffc9e92%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
