Is there any kind of guide on how I should be sizing these numbers?
I tried doubling the cache-memory-max-size and quadrupling the
cache-snapshot-memory-size. I also tried writing fewer values per request, but
that didn't really seem to help.
I tried switching to writing only 1k, 5k, 8k, 10k, 20k, and 80k values per
request per second. I had my process log out a warning any time requests took
longer than one second to complete since I'm generating data at a rate of once
per second. This resulted in anywhere from 1 to 50 posts per second (I had
overestimated the number of values per second I was writing - it's actually
somewhere around 50k). It seemed like somewhere around 30k values per request
actually worked best. I'd very frequently see multiple requests take longer
than 1s when sending 50 posts with 1k values per second. With 30k values per
post, the individual posts would take about 1-3 seconds to complete but seemed
to "catch up" every now and then. That is, it could go for up to 10 seconds
without seeing a request take longer than 1s to complete whereas with 50 posts
per second, I'd always see at least a handful of posts take longer than 1s in
every group.
I did still see timeout errors with this configuration:
Dec 24 10:12:37 process [27383]: 2016-12-24T10:12:37.662Z - error:
{"error":"timeout"}
Dec 24 10:12:37 process [27383]: 2016-12-24T10:12:37.671Z - warn: db long write
duration: 10037
... (there were about 7 of these in a row, all taking longer than 10s to
complete and timing out)
And this is the compaction log from influx around the same time:
Dec 24 10:10:10 host influxdb[26479]: [tsm1] 2016/12/24 10:10:10 compacted full
group (0) into /var/lib/influxdb/data/host/1w/1224/000000689-000000005.tsm.tmp
(#0)
Dec 24 10:10:10 host influxdb[26479]: [tsm1] 2016/12/24 10:10:10 compacted full
4 files into 1 files in 2m38.182651921s
Dec 24 10:10:10 host influxdb[26479]: [tsm1] 2016/12/24 10:10:10 beginning full
compaction of group 0, 2 TSM files
Dec 24 10:10:10 host influxdb[26479]: [tsm1] 2016/12/24 10:10:10 compacting
full group (0) /var/lib/influxdb/data/host/1w/1224/000000496-000000006.tsm (#0)
Dec 24 10:10:10 host influxdb[26479]: [tsm1] 2016/12/24 10:10:10 compacting
full group (0) /var/lib/influxdb/data/host/1w/1224/000000689-000000005.tsm (#1)
Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15 beginning
level 1 compaction of group 0, 6 TSM files
Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15 compacting
level 1 group (0) /var/lib/influxdb/data/host/1w/1224/000000690-000000001.tsm
(#0)
Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15 compacting
level 1 group (0) /var/lib/influxdb/data/host/1w/1224/000000690-000000001.tsm
(#1)
Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15 compacting
level 1 group (0) /var/lib/influxdb/data/host/1w/1224/000000690-000000001.tsm
(#2)
Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15 compacting
level 1 group (0) /var/lib/influxdb/data/host/1w/1224/000000691-000000001.tsm
(#3)
Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15 compacting
level 1 group (0) /var/lib/influxdb/data/host/1w/1224/000000691-000000001.tsm
(#4)
Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15 compacting
level 1 group (0) /var/lib/influxdb/data/host/1w/1224/000000691-000000001.tsm
(#5)
Dec 24 10:12:16 host influxdb[26479]: [tsm1] 2016/12/24 10:12:16 Snapshot for
path /var/lib/influxdb/data/host/1w/1224 written in 4.217841051s
Dec 24 10:12:26 host influxdb[26479]: [tsm1] 2016/12/24 10:12:26 compacted
level 1 group (0) into
/var/lib/influxdb/data/host/1w/1224/000000691-000000002.tsm.tmp (#0)
Dec 24 10:12:26 host influxdb[26479]: [tsm1] 2016/12/24 10:12:26 compacted
level 1 6 files into 1 files in 10.458949133s
Dec 24 10:14:22 host influxdb[26479]: [tsm1] 2016/12/24 10:14:22 compacted full
group (0) into /var/lib/influxdb/data/host/1w/1224/000000689-000000006.tsm.tmp
(#0)
Dec 24 10:14:22 host influxdb[26479]: [tsm1] 2016/12/24 10:14:22 compacted full
2 files into 1 files in 4m11.403244468s
Compaction times have definitely gone up there at 4 minutes and 11 seconds.
Do you have any further suggestions of how I can "tune" influx for handling
this large and fast volume of writes? I could send posts less frequently, but
it's still the same amount of data. So if I did posts every 3 seconds, I would
have to send 3x the number of requests every 3 seconds.
Most queries that run against the DB are for realtime charts (similar to
grafana) which are displaying a 5 or 10 minute window of 1s data for a small
number of values and tags. These queries seem to be pretty performant (only
taking about 70ms for a batch of 5 queries).
I'm still not seeing any bottlenecks in terms of memory or CPU (as in I never
see either of them really spike or max out). The harddrive is a modern SSD and
we recently increased the RAM to 16GB. I'm not sure what's causing the long
write times, or if it's just a combination of queries, the continuous query,
and compaction that's giving it a hard time.
Thanks again for the help so far!
On Thursday, December 22, 2016 at 9:45:48 AM UTC-6, Paul Dix wrote:
> You might try breaking it up further. We generally do performance tests with
> 1k-10k values per request. You can set the WAL snapshotting sizes here:
> https://github.com/influxdata/influxdb/blob/master/etc/config.sample.toml#L62-L68
>
>
>
> On Wed, Dec 21, 2016 at 2:40 PM, <[email protected]> wrote:
> I'm writing approximately 78,000 values per request (about 50 values per
> point with 1560 points every 3 seconds). I saw similar behavior when writing
> 26,000 values per request every 1 second.
>
>
>
> Should I try breaking those up into smaller writes instead of larger ones?
>
>
>
> How can I adjust the max WAL cache size? I don't see that as an available
> configuration option in v1.1:
>
> https://docs.influxdata.com/influxdb/v1.1/administration/config#environment-variables
>
>
>
> Thanks!
>
>
>
> On Wednesday, December 21, 2016 at 11:20:46 AM UTC-8, Paul Dix wrote:
>
> > Compactions shouldn't cause write timeouts. I would suspect that write
> > timeouts are happening because you're posting too many values per request.
> > You can also try increasing the max WAL cache size.
>
> >
>
> >
>
> > How many actual values are you writing per request? That is field values.
> > For example:
>
> >
>
> >
>
> > cpu,host=serverA usage_user=23,usage_system=5
>
> >
>
> >
>
> > Represents 2 values posted, not one. That might help narrow things down.
>
> >
>
> >
>
>
>
> > On Wed, Dec 21, 2016 at 1:09 PM, Jeff <[email protected]> wrote:
>
> > Facing an interesting problem with my current InfluxDB single instance
> > deployment. I'm running on an 8 core machine with 8GB RAM (physical
> > hardware) with InfluxDB v1.1.1 running in a docker container.
>
> >
>
> >
>
> >
>
> > I'm writing 520 points in batches of 1560 every 3 seconds to a Retention
> > Policy of "1w" with a "1d" shard group duration. Each point contains about
> > 50 fields of data. Total in the measurement, there are 115 fields. So for
> > any given point, most of the fields are empty, but over a selection of all
> > series, all fields are used.
>
> >
>
> >
>
> >
>
> > There's 1 tag in the measurement with about 520 series. I've got 1
> > ContinuousQuery configured to run every 3 minutes. The CQ is *massive*. It
> > looks something like this:
>
> >
>
> > "CREATE CONTINUOUS QUERY "\"3m\"" ON MyDB BEGIN SELECT mean(val1) AS val1,
> > mean(val2) AS val2, .... this continues for ALL 115 fields ... INTO
> > MyDB."16w".devices FROM MyDB."1w".devices GROUP BY time(3m), device END"
>
> >
>
> >
>
> >
>
> > Surprisingly I don't think the CQ is causing too much of a performance
> > issue at the moment. Instead what I'm seeing in the influx logs is the
> > following:
>
> >
>
> >
>
> >
>
> > Dec 21 18:08:30 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:30
> > beginning level 3 compaction of group 0, 4 TSM files
>
> >
>
> > Dec 21 18:08:30 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:30
> > compacting level 3 group (0)
> > /var/lib/influxdb/data/hostname/1w/1212/000000773-000000003.tsm (#0)
>
> >
>
> > Dec 21 18:08:30 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:30
> > compacting level 3 group (0)
> > /var/lib/influxdb/data/hostname/1w/1212/000000777-000000003.tsm (#1)
>
> >
>
> > Dec 21 18:08:30 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:30
> > compacting level 3 group (0)
> > /var/lib/influxdb/data/hostname/1w/1212/000000781-000000003.tsm (#2)
>
> >
>
> > Dec 21 18:08:30 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:30
> > compacting level 3 group (0)
> > /var/lib/influxdb/data/hostname/1w/1212/000000785-000000003.tsm (#3)
>
> >
>
> > Dec 21 18:08:37 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:37
> > compacted level 3 group (0) into
> > /var/lib/influxdb/data/hostname/1w/1212/000000785-000000004.tsm.tm
>
> >
>
> > Dec 21 18:08:37 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:37
> > compacted level 3 4 files into 1 files in 6.339871251s
>
> >
>
> > Dec 21 18:08:37 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:37
> > beginning full compaction of group 0, 2 TSM files
>
> >
>
> > Dec 21 18:08:37 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:37
> > compacting full group (0)
> > /var/lib/influxdb/data/hostname/1w/1212/000000769-000000005.tsm (#0)
>
> >
>
> > Dec 21 18:08:37 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:37
> > compacting full group (0)
> > /var/lib/influxdb/data/hostname/1w/1212/000000785-000000004.tsm (#1)
>
> >
>
> > Dec 21 18:09:00 hostname influxdb[3119]: [tsm1] 2016/12/21 18:09:00
> > compacted full group (0) into
> > /var/lib/influxdb/data/hostname/1w/1212/000000785-000000005.tsm.tmp (
>
> >
>
> > Dec 21 18:09:00 hostname influxdb[3119]: [tsm1] 2016/12/21 18:09:00
> > compacted full 2 files into 1 files in 23.549201117s
>
> >
>
> >
>
> >
>
> > Not only do those compaction times seem very long (23.5 seconds?) but while
> > that full compaction is being performed, I'm getting "timeout" on writes.
> > That is, it starts taking longer than 10 seconds (default influx http write
> > timeout) for the write to be performed/acknowledged by influx. I've seen
> > the full compaction times hover around 30s consistently and seem to happen
> > about once every 30 minutes.
>
> >
>
> >
>
> >
>
> > The influxDB instance seems to be using all available RAM on the machine. I
> > had to cap the docker container at 6GB memory usage in order to not starve
> > the rest of the system of resources.
>
> >
>
> >
>
> >
>
> > Here's a copy of my logs noting very long write times in conjunction with a
> > full compaction occurring on the database:
>
> >
>
> > Process log (write duration is in ms):
>
> >
>
> > Dec 21 12:28:42 hostname process[11361]: 2016-12-21T12:28:42.615Z - warn:
> > db long write duration: 9824
>
> >
>
> > Dec 21 12:28:44 hostname process[11361]: 2016-12-21T12:28:44.106Z - warn:
> > db long write duration: 8242
>
> >
>
> > Dec 21 12:28:44 hostname process[11361]: 2016-12-21T12:28:44.214Z - warn:
> > db long write duration: 5260
>
> >
>
> > Dec 21 12:28:44 hostname process[11361]: 2016-12-21T12:28:44.314Z - warn:
> > db long write duration: 2273
>
> >
>
> > Dec 21 12:29:23 hostname process[11361]: 2016-12-21T12:29:23.667Z - warn:
> > db long write duration: 5044
>
> >
>
> > Dec 21 12:29:24 hostname process[11361]: 2016-12-21T12:29:24.710Z - warn:
> > db long write duration: 3036
>
> >
>
> > Dec 21 12:29:54 hostname process[11361]: 2016-12-21T12:29:54.533Z - warn:
> > db long write duration: 2393
>
> >
>
> > Dec 21 12:29:56 hostname process[11361]: 2016-12-21T12:29:56.793Z - warn:
> > db long write duration: 1588
>
> >
>
> > Dec 21 12:30:33 hostname process[11361]: 2016-12-21T12:30:33.274Z - warn:
> > db long write duration: 1513
>
> >
>
> >
>
> >
>
> > Influx log:
>
> >
>
> > Dec 21 12:28:22 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:22
> > compacted level 3 group (0) into
> > /var/lib/influxdb/data/hostname/1w/1212/000000529-000000004.tsm.tm
>
> >
>
> > Dec 21 12:28:22 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:22
> > compacted level 3 8 files into 1 files in 13.399871009s
>
> >
>
> > Dec 21 12:28:22 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:22
> > beginning full compaction of group 0, 2 TSM files
>
> >
>
> > Dec 21 12:28:22 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:22
> > compacting full group (0)
> > /var/lib/influxdb/data/hostname/1w/1212/000000513-000000005.tsm (#0)
>
> >
>
> > Dec 21 12:28:22 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:22
> > compacting full group (0)
> > /var/lib/influxdb/data/hostname/1w/1212/000000529-000000004.tsm (#1)
>
> >
>
> > Dec 21 12:28:44 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:44
> > compacted full group (0) into
> > /var/lib/influxdb/data/hostname/1w/1212/000000529-000000005.tsm.tmp (
>
> >
>
> > Dec 21 12:28:44 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:44
> > compacted full 2 files into 1 files in 21.447891815s
>
> >
>
> > Dec 21 12:28:44 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:44
> > beginning full compaction of group 0, 2 TSM files
>
> >
>
> > Dec 21 12:28:44 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:44
> > compacting full group (0)
> > /var/lib/influxdb/data/hostname/1w/1212/000000337-000000006.tsm (#0)
>
> >
>
> > Dec 21 12:28:44 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:44
> > compacting full group (0)
> > /var/lib/influxdb/data/hostname/1w/1212/000000529-000000005.tsm (#1)
>
> >
>
> > Dec 21 12:29:26 hostname influxdb[3119]: [tsm1] 2016/12/21 12:29:26
> > Snapshot for path /var/lib/influxdb/data/hostname/1w/1212 written in
> > 788.281773ms
>
> >
>
> > Dec 21 12:30:04 hostname influxdb[3119]: [tsm1] 2016/12/21 12:30:04
> > Snapshot for path /var/lib/influxdb/data/hostname/16w/1213 written in
> > 985.274321ms
>
> >
>
> >
>
> >
>
> > Is there anything I can do to help these compaction times be shorter? Would
> > having smaller shard groups (maybe 1h instead of 1d) help? Is the sheer
> > number of fields causing a problem? I could potentially break up the
> > measurement into multiple such that no one measurement has more than about
> > 50 fields.
>
> >
>
> >
>
> >
>
> > Thanks for any suggestions!
>
> >
>
> >
>
> >
>
> > --
>
> >
>
> > Remember to include the version number!
>
> >
>
> > ---
>
> >
>
> > You received this message because you are subscribed to the Google Groups
> > "InfluxData" group.
>
> >
>
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to [email protected].
>
> >
>
> > To post to this group, send email to [email protected].
>
> >
>
> > Visit this group at https://groups.google.com/group/influxdb.
>
> >
>
> > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/influxdb/7c735db5-5c40-43ab-946a-f0a98a231adf%40googlegroups.com.
>
> >
>
> > For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
>
> Remember to include the version number!
>
> ---
>
> You received this message because you are subscribed to the Google Groups
> "InfluxData" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
>
> To post to this group, send email to [email protected].
>
> Visit this group at https://groups.google.com/group/influxdb.
>
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/influxdb/a55751cd-4ec2-4cd2-9e15-d0984e4b23d1%40googlegroups.com.
>
>
>
> For more options, visit https://groups.google.com/d/optout.
--
Remember to include the version number!
---
You received this message because you are subscribed to the Google Groups
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit
https://groups.google.com/d/msgid/influxdb/3559f0e0-3c8b-44c5-9750-90460e69473e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.