Re: [influxdb] Full Compaction taking 30s and timing out writes and queries

jwheeler Sat, 24 Dec 2016 08:01:17 -0800

Is there any kind of guide on how I should be sizing these numbers?

I tried doubling the cache-memory-max-size and quadrupling the 
cache-snapshot-memory-size. I also tried writing fewer values per request, but 
that didn't really seem to help.


I tried switching to writing only 1k, 5k, 8k, 10k, 20k, and 80k values per 
request per second. I had my process log out a warning any time requests took 
longer than one second to complete since I'm generating data at a rate of once 
per second. This resulted in anywhere from 1 to 50 posts per second (I had 
overestimated the number of values per second I was writing - it's actually 
somewhere around 50k). It seemed like somewhere around 30k values per request 
actually worked best. I'd very frequently see multiple requests take longer 
than 1s when sending 50 posts with 1k values per second. With 30k values per 
post, the individual posts would take about 1-3 seconds to complete but seemed 
to "catch up" every now and then. That is, it could go for up to 10 seconds 
without seeing a request take longer than 1s to complete whereas with 50 posts 
per second, I'd always see at least a handful of posts take longer than 1s in 
every group.

I did still see timeout errors with this configuration:
Dec 24 10:12:37 process [27383]: 2016-12-24T10:12:37.662Z - error: 
{"error":"timeout"}
Dec 24 10:12:37 process [27383]: 2016-12-24T10:12:37.671Z - warn: db long write 
duration: 10037
... (there were about 7 of these in a row, all taking longer than 10s to 
complete and timing out)

And this is the compaction log from influx around the same time:
Dec 24 10:10:10 host influxdb[26479]: [tsm1] 2016/12/24 10:10:10 compacted full 
group (0) into /var/lib/influxdb/data/host/1w/1224/000000689-000000005.tsm.tmp 
(#0)
Dec 24 10:10:10 host influxdb[26479]: [tsm1] 2016/12/24 10:10:10 compacted full 
4 files into 1 files in 2m38.182651921s
Dec 24 10:10:10 host influxdb[26479]: [tsm1] 2016/12/24 10:10:10 beginning full 
compaction of group 0, 2 TSM files
Dec 24 10:10:10 host influxdb[26479]: [tsm1] 2016/12/24 10:10:10 compacting 
full group (0) /var/lib/influxdb/data/host/1w/1224/000000496-000000006.tsm (#0)
Dec 24 10:10:10 host influxdb[26479]: [tsm1] 2016/12/24 10:10:10 compacting 
full group (0) /var/lib/influxdb/data/host/1w/1224/000000689-000000005.tsm (#1)
Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15 beginning 
level 1 compaction of group 0, 6 TSM files
Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15 compacting 
level 1 group (0) /var/lib/influxdb/data/host/1w/1224/000000690-000000001.tsm 
(#0)
Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15 compacting 
level 1 group (0) /var/lib/influxdb/data/host/1w/1224/000000690-000000001.tsm 
(#1)
Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15 compacting 
level 1 group (0) /var/lib/influxdb/data/host/1w/1224/000000690-000000001.tsm 
(#2)
Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15 compacting 
level 1 group (0) /var/lib/influxdb/data/host/1w/1224/000000691-000000001.tsm 
(#3)
Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15 compacting 
level 1 group (0) /var/lib/influxdb/data/host/1w/1224/000000691-000000001.tsm 
(#4)
Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15 compacting 
level 1 group (0) /var/lib/influxdb/data/host/1w/1224/000000691-000000001.tsm 
(#5)
Dec 24 10:12:16 host influxdb[26479]: [tsm1] 2016/12/24 10:12:16 Snapshot for 
path /var/lib/influxdb/data/host/1w/1224 written in 4.217841051s
Dec 24 10:12:26 host influxdb[26479]: [tsm1] 2016/12/24 10:12:26 compacted 
level 1 group (0) into 
/var/lib/influxdb/data/host/1w/1224/000000691-000000002.tsm.tmp (#0)
Dec 24 10:12:26 host influxdb[26479]: [tsm1] 2016/12/24 10:12:26 compacted 
level 1 6 files into 1 files in 10.458949133s
Dec 24 10:14:22 host influxdb[26479]: [tsm1] 2016/12/24 10:14:22 compacted full 
group (0) into /var/lib/influxdb/data/host/1w/1224/000000689-000000006.tsm.tmp 
(#0)
Dec 24 10:14:22 host influxdb[26479]: [tsm1] 2016/12/24 10:14:22 compacted full 
2 files into 1 files in 4m11.403244468s

Compaction times have definitely gone up there at 4 minutes and 11 seconds.

Do you have any further suggestions of how I can "tune" influx for handling 
this large and fast volume of writes? I could send posts less frequently, but 
it's still the same amount of data. So if I did posts every 3 seconds, I would 
have to send 3x the number of requests every 3 seconds.

Most queries that run against the DB are for realtime charts (similar to 
grafana) which are displaying a 5 or 10 minute window of 1s data for a small 
number of values and tags. These queries seem to be pretty performant (only 
taking about 70ms for a batch of 5 queries).

I'm still not seeing any bottlenecks in terms of memory or CPU (as in I never 
see either of them really spike or max out). The harddrive is a modern SSD and 
we recently increased the RAM to 16GB. I'm not sure what's causing the long 
write times, or if it's just a combination of queries, the continuous query, 
and compaction that's giving it a hard time.

Thanks again for the help so far!

On Thursday, December 22, 2016 at 9:45:48 AM UTC-6, Paul Dix wrote:
> You might try breaking it up further. We generally do performance tests with 
> 1k-10k values per request. You can set the WAL snapshotting sizes here:
> https://github.com/influxdata/influxdb/blob/master/etc/config.sample.toml#L62-L68
> 
> 
> 
> On Wed, Dec 21, 2016 at 2:40 PM,  <[email protected]> wrote:
> I'm writing approximately 78,000 values per request (about 50 values per 
> point with 1560 points every 3 seconds). I saw similar behavior when writing 
> 26,000 values per request every 1 second.
> 
> 
> 
> Should I try breaking those up into smaller writes instead of larger ones?
> 
> 
> 
> How can I adjust the max WAL cache size? I don't see that as an available 
> configuration option in v1.1:
> 
> https://docs.influxdata.com/influxdb/v1.1/administration/config#environment-variables
> 
> 
> 
> Thanks!
> 
> 
> 
> On Wednesday, December 21, 2016 at 11:20:46 AM UTC-8, Paul Dix wrote:
> 
> > Compactions shouldn't cause write timeouts. I would suspect that write 
> > timeouts are happening because you're posting too many values per request. 
> > You can also try increasing the max WAL cache size.
> 
> >
> 
> >
> 
> > How many actual values are you writing per request? That is field values. 
> > For example:
> 
> >
> 
> >
> 
> > cpu,host=serverA usage_user=23,usage_system=5
> 
> >
> 
> >
> 
> > Represents 2 values posted, not one. That might help narrow things down.
> 
> >
> 
> >
> 
> 
> 
> > On Wed, Dec 21, 2016 at 1:09 PM, Jeff <[email protected]> wrote:
> 
> > Facing an interesting problem with my current InfluxDB single instance 
> > deployment. I'm running on an 8 core machine with 8GB RAM (physical 
> > hardware) with InfluxDB v1.1.1 running in a docker container.
> 
> >
> 
> >
> 
> >
> 
> > I'm writing 520 points in batches of 1560 every 3 seconds to a Retention 
> > Policy of "1w" with a "1d" shard group duration. Each point contains about 
> > 50 fields of data. Total in the measurement, there are 115 fields. So for 
> > any given point, most of the fields are empty, but over a selection of all 
> > series, all fields are used.
> 
> >
> 
> >
> 
> >
> 
> > There's 1 tag in the measurement with about 520 series. I've got 1 
> > ContinuousQuery configured to run every 3 minutes. The CQ is *massive*. It 
> > looks something like this:
> 
> >
> 
> > "CREATE CONTINUOUS QUERY "\"3m\"" ON MyDB BEGIN SELECT mean(val1) AS val1, 
> > mean(val2) AS val2, .... this continues for ALL 115 fields ... INTO 
> > MyDB."16w".devices FROM MyDB."1w".devices GROUP BY time(3m), device END"
> 
> >
> 
> >
> 
> >
> 
> > Surprisingly I don't think the CQ is causing too much of a performance 
> > issue at the moment. Instead what I'm seeing in the influx logs is the 
> > following:
> 
> >
> 
> >
> 
> >
> 
> > Dec 21 18:08:30 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:30 
> > beginning level 3 compaction of group 0, 4 TSM files
> 
> >
> 
> > Dec 21 18:08:30 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:30 
> > compacting level 3 group (0) 
> > /var/lib/influxdb/data/hostname/1w/1212/000000773-000000003.tsm (#0)
> 
> >
> 
> > Dec 21 18:08:30 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:30 
> > compacting level 3 group (0) 
> > /var/lib/influxdb/data/hostname/1w/1212/000000777-000000003.tsm (#1)
> 
> >
> 
> > Dec 21 18:08:30 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:30 
> > compacting level 3 group (0) 
> > /var/lib/influxdb/data/hostname/1w/1212/000000781-000000003.tsm (#2)
> 
> >
> 
> > Dec 21 18:08:30 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:30 
> > compacting level 3 group (0) 
> > /var/lib/influxdb/data/hostname/1w/1212/000000785-000000003.tsm (#3)
> 
> >
> 
> > Dec 21 18:08:37 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:37 
> > compacted level 3 group (0) into 
> > /var/lib/influxdb/data/hostname/1w/1212/000000785-000000004.tsm.tm
> 
> >
> 
> > Dec 21 18:08:37 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:37 
> > compacted level 3 4 files into 1 files in 6.339871251s
> 
> >
> 
> > Dec 21 18:08:37 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:37 
> > beginning full compaction of group 0, 2 TSM files
> 
> >
> 
> > Dec 21 18:08:37 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:37 
> > compacting full group (0) 
> > /var/lib/influxdb/data/hostname/1w/1212/000000769-000000005.tsm (#0)
> 
> >
> 
> > Dec 21 18:08:37 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:37 
> > compacting full group (0) 
> > /var/lib/influxdb/data/hostname/1w/1212/000000785-000000004.tsm (#1)
> 
> >
> 
> > Dec 21 18:09:00 hostname influxdb[3119]: [tsm1] 2016/12/21 18:09:00 
> > compacted full group (0) into 
> > /var/lib/influxdb/data/hostname/1w/1212/000000785-000000005.tsm.tmp (
> 
> >
> 
> > Dec 21 18:09:00 hostname influxdb[3119]: [tsm1] 2016/12/21 18:09:00 
> > compacted full 2 files into 1 files in 23.549201117s
> 
> >
> 
> >
> 
> >
> 
> > Not only do those compaction times seem very long (23.5 seconds?) but while 
> > that full compaction is being performed, I'm getting "timeout" on writes. 
> > That is, it starts taking longer than 10 seconds (default influx http write 
> > timeout) for the write to be performed/acknowledged by influx. I've seen 
> > the full compaction times hover around 30s consistently and seem to happen 
> > about once every 30 minutes.
> 
> >
> 
> >
> 
> >
> 
> > The influxDB instance seems to be using all available RAM on the machine. I 
> > had to cap the docker container at 6GB memory usage in order to not starve 
> > the rest of the system of resources.
> 
> >
> 
> >
> 
> >
> 
> > Here's a copy of my logs noting very long write times in conjunction with a 
> > full compaction occurring on the database:
> 
> >
> 
> > Process log (write duration is in ms):
> 
> >
> 
> > Dec 21 12:28:42 hostname process[11361]: 2016-12-21T12:28:42.615Z - warn: 
> > db long write duration: 9824
> 
> >
> 
> > Dec 21 12:28:44 hostname process[11361]: 2016-12-21T12:28:44.106Z - warn: 
> > db long write duration: 8242
> 
> >
> 
> > Dec 21 12:28:44 hostname process[11361]: 2016-12-21T12:28:44.214Z - warn: 
> > db long write duration: 5260
> 
> >
> 
> > Dec 21 12:28:44 hostname process[11361]: 2016-12-21T12:28:44.314Z - warn: 
> > db long write duration: 2273
> 
> >
> 
> > Dec 21 12:29:23 hostname process[11361]: 2016-12-21T12:29:23.667Z - warn: 
> > db long write duration: 5044
> 
> >
> 
> > Dec 21 12:29:24 hostname process[11361]: 2016-12-21T12:29:24.710Z - warn: 
> > db long write duration: 3036
> 
> >
> 
> > Dec 21 12:29:54 hostname process[11361]: 2016-12-21T12:29:54.533Z - warn: 
> > db long write duration: 2393
> 
> >
> 
> > Dec 21 12:29:56 hostname process[11361]: 2016-12-21T12:29:56.793Z - warn: 
> > db long write duration: 1588
> 
> >
> 
> > Dec 21 12:30:33 hostname process[11361]: 2016-12-21T12:30:33.274Z - warn: 
> > db long write duration: 1513
> 
> >
> 
> >
> 
> >
> 
> > Influx log:
> 
> >
> 
> > Dec 21 12:28:22 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:22 
> > compacted level 3 group (0) into 
> > /var/lib/influxdb/data/hostname/1w/1212/000000529-000000004.tsm.tm
> 
> >
> 
> > Dec 21 12:28:22 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:22 
> > compacted level 3 8 files into 1 files in 13.399871009s
> 
> >
> 
> > Dec 21 12:28:22 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:22 
> > beginning full compaction of group 0, 2 TSM files
> 
> >
> 
> > Dec 21 12:28:22 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:22 
> > compacting full group (0) 
> > /var/lib/influxdb/data/hostname/1w/1212/000000513-000000005.tsm (#0)
> 
> >
> 
> > Dec 21 12:28:22 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:22 
> > compacting full group (0) 
> > /var/lib/influxdb/data/hostname/1w/1212/000000529-000000004.tsm (#1)
> 
> >
> 
> > Dec 21 12:28:44 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:44 
> > compacted full group (0) into 
> > /var/lib/influxdb/data/hostname/1w/1212/000000529-000000005.tsm.tmp (
> 
> >
> 
> > Dec 21 12:28:44 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:44 
> > compacted full 2 files into 1 files in 21.447891815s
> 
> >
> 
> > Dec 21 12:28:44 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:44 
> > beginning full compaction of group 0, 2 TSM files
> 
> >
> 
> > Dec 21 12:28:44 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:44 
> > compacting full group (0) 
> > /var/lib/influxdb/data/hostname/1w/1212/000000337-000000006.tsm (#0)
> 
> >
> 
> > Dec 21 12:28:44 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:44 
> > compacting full group (0) 
> > /var/lib/influxdb/data/hostname/1w/1212/000000529-000000005.tsm (#1)
> 
> >
> 
> > Dec 21 12:29:26 hostname influxdb[3119]: [tsm1] 2016/12/21 12:29:26 
> > Snapshot for path /var/lib/influxdb/data/hostname/1w/1212 written in 
> > 788.281773ms
> 
> >
> 
> > Dec 21 12:30:04 hostname influxdb[3119]: [tsm1] 2016/12/21 12:30:04 
> > Snapshot for path /var/lib/influxdb/data/hostname/16w/1213 written in 
> > 985.274321ms
> 
> >
> 
> >
> 
> >
> 
> > Is there anything I can do to help these compaction times be shorter? Would 
> > having smaller shard groups (maybe 1h instead of 1d) help? Is the sheer 
> > number of fields causing a problem? I could potentially break up the 
> > measurement into multiple such that no one measurement has more than about 
> > 50 fields.
> 
> >
> 
> >
> 
> >
> 
> > Thanks for any suggestions!
> 
> >
> 
> >
> 
> >
> 
> > --
> 
> >
> 
> > Remember to include the version number!
> 
> >
> 
> > ---
> 
> >
> 
> > You received this message because you are subscribed to the Google Groups 
> > "InfluxData" group.
> 
> >
> 
> > To unsubscribe from this group and stop receiving emails from it, send an 
> > email to [email protected].
> 
> >
> 
> > To post to this group, send email to [email protected].
> 
> >
> 
> > Visit this group at https://groups.google.com/group/influxdb.
> 
> >
> 
> > To view this discussion on the web visit 
> > https://groups.google.com/d/msgid/influxdb/7c735db5-5c40-43ab-946a-f0a98a231adf%40googlegroups.com.
> 
> >
> 
> > For more options, visit https://groups.google.com/d/optout.
> 
> 
> 
> --
> 
> Remember to include the version number!
> 
> ---
> 
> You received this message because you are subscribed to the Google Groups 
> "InfluxData" group.
> 
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> 
> To post to this group, send email to [email protected].
> 
> Visit this group at https://groups.google.com/group/influxdb.
> 
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/influxdb/a55751cd-4ec2-4cd2-9e15-d0984e4b23d1%40googlegroups.com.
> 
> 
> 
> For more options, visit https://groups.google.com/d/optout.

-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/3559f0e0-3c8b-44c5-9750-90460e69473e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [influxdb] Full Compaction taking 30s and timing out writes and queries

Reply via email to