Re: [influxdb] Full Compaction taking 30s and timing out writes and queries

Paul Dix Mon, 26 Dec 2016 07:51:41 -0800

80k values per second should be no problem. We regularly test at > 800k
values/sec and what's on master now will do ~2M values/sec if you're on a
large enough box.


You should be posting 1k-2k values per post, but have multiple threads or
processes doing it. Concurrency is the key. The total number of values/sec
shouldn't be a problem on your hardware (assuming you're doing < 100k
values/sec).

On Sat, Dec 24, 2016 at 10:00 AM, <[email protected]> wrote:

> Is there any kind of guide on how I should be sizing these numbers?
>
> I tried doubling the cache-memory-max-size and quadrupling the
> cache-snapshot-memory-size. I also tried writing fewer values per request,
> but that didn't really seem to help.
>
> I tried switching to writing only 1k, 5k, 8k, 10k, 20k, and 80k values per
> request per second. I had my process log out a warning any time requests
> took longer than one second to complete since I'm generating data at a rate
> of once per second. This resulted in anywhere from 1 to 50 posts per second
> (I had overestimated the number of values per second I was writing - it's
> actually somewhere around 50k). It seemed like somewhere around 30k values
> per request actually worked best. I'd very frequently see multiple requests
> take longer than 1s when sending 50 posts with 1k values per second. With
> 30k values per post, the individual posts would take about 1-3 seconds to
> complete but seemed to "catch up" every now and then. That is, it could go
> for up to 10 seconds without seeing a request take longer than 1s to
> complete whereas with 50 posts per second, I'd always see at least a
> handful of posts take longer than 1s in every group.
>
> I did still see timeout errors with this configuration:
> Dec 24 10:12:37 process [27383]: 2016-12-24T10:12:37.662Z - error:
> {"error":"timeout"}
> Dec 24 10:12:37 process [27383]: 2016-12-24T10:12:37.671Z - warn: db long
> write duration: 10037
> ... (there were about 7 of these in a row, all taking longer than 10s to
> complete and timing out)
>
> And this is the compaction log from influx around the same time:
> Dec 24 10:10:10 host influxdb[26479]: [tsm1] 2016/12/24 10:10:10 compacted
> full group (0) into /var/lib/influxdb/data/host/
> 1w/1224/000000689-000000005.tsm.tmp (#0)
> Dec 24 10:10:10 host influxdb[26479]: [tsm1] 2016/12/24 10:10:10 compacted
> full 4 files into 1 files in 2m38.182651921s
> Dec 24 10:10:10 host influxdb[26479]: [tsm1] 2016/12/24 10:10:10 beginning
> full compaction of group 0, 2 TSM files
> Dec 24 10:10:10 host influxdb[26479]: [tsm1] 2016/12/24 10:10:10
> compacting full group (0) /var/lib/influxdb/data/host/
> 1w/1224/000000496-000000006.tsm (#0)
> Dec 24 10:10:10 host influxdb[26479]: [tsm1] 2016/12/24 10:10:10
> compacting full group (0) /var/lib/influxdb/data/host/
> 1w/1224/000000689-000000005.tsm (#1)
> Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15 beginning
> level 1 compaction of group 0, 6 TSM files
> Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15
> compacting level 1 group (0) /var/lib/influxdb/data/host/
> 1w/1224/000000690-000000001.tsm (#0)
> Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15
> compacting level 1 group (0) /var/lib/influxdb/data/host/
> 1w/1224/000000690-000000001.tsm (#1)
> Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15
> compacting level 1 group (0) /var/lib/influxdb/data/host/
> 1w/1224/000000690-000000001.tsm (#2)
> Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15
> compacting level 1 group (0) /var/lib/influxdb/data/host/
> 1w/1224/000000691-000000001.tsm (#3)
> Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15
> compacting level 1 group (0) /var/lib/influxdb/data/host/
> 1w/1224/000000691-000000001.tsm (#4)
> Dec 24 10:12:15 host influxdb[26479]: [tsm1] 2016/12/24 10:12:15
> compacting level 1 group (0) /var/lib/influxdb/data/host/
> 1w/1224/000000691-000000001.tsm (#5)
> Dec 24 10:12:16 host influxdb[26479]: [tsm1] 2016/12/24 10:12:16 Snapshot
> for path /var/lib/influxdb/data/host/1w/1224 written in 4.217841051s
> Dec 24 10:12:26 host influxdb[26479]: [tsm1] 2016/12/24 10:12:26 compacted
> level 1 group (0) into /var/lib/influxdb/data/host/
> 1w/1224/000000691-000000002.tsm.tmp (#0)
> Dec 24 10:12:26 host influxdb[26479]: [tsm1] 2016/12/24 10:12:26 compacted
> level 1 6 files into 1 files in 10.458949133s
> Dec 24 10:14:22 host influxdb[26479]: [tsm1] 2016/12/24 10:14:22 compacted
> full group (0) into /var/lib/influxdb/data/host/
> 1w/1224/000000689-000000006.tsm.tmp (#0)
> Dec 24 10:14:22 host influxdb[26479]: [tsm1] 2016/12/24 10:14:22 compacted
> full 2 files into 1 files in 4m11.403244468s
>
> Compaction times have definitely gone up there at 4 minutes and 11 seconds.
>
> Do you have any further suggestions of how I can "tune" influx for
> handling this large and fast volume of writes? I could send posts less
> frequently, but it's still the same amount of data. So if I did posts every
> 3 seconds, I would have to send 3x the number of requests every 3 seconds.
>
> Most queries that run against the DB are for realtime charts (similar to
> grafana) which are displaying a 5 or 10 minute window of 1s data for a
> small number of values and tags. These queries seem to be pretty performant
> (only taking about 70ms for a batch of 5 queries).
>
> I'm still not seeing any bottlenecks in terms of memory or CPU (as in I
> never see either of them really spike or max out). The harddrive is a
> modern SSD and we recently increased the RAM to 16GB. I'm not sure what's
> causing the long write times, or if it's just a combination of queries, the
> continuous query, and compaction that's giving it a hard time.
>
> Thanks again for the help so far!
>
> On Thursday, December 22, 2016 at 9:45:48 AM UTC-6, Paul Dix wrote:
> > You might try breaking it up further. We generally do performance tests
> with 1k-10k values per request. You can set the WAL snapshotting sizes here:
> > https://github.com/influxdata/influxdb/blob/master/etc/
> config.sample.toml#L62-L68
> >
> >
> >
> > On Wed, Dec 21, 2016 at 2:40 PM,  <[email protected]> wrote:
> > I'm writing approximately 78,000 values per request (about 50 values per
> point with 1560 points every 3 seconds). I saw similar behavior when
> writing 26,000 values per request every 1 second.
> >
> >
> >
> > Should I try breaking those up into smaller writes instead of larger
> ones?
> >
> >
> >
> > How can I adjust the max WAL cache size? I don't see that as an
> available configuration option in v1.1:
> >
> > https://docs.influxdata.com/influxdb/v1.1/administration/
> config#environment-variables
> >
> >
> >
> > Thanks!
> >
> >
> >
> > On Wednesday, December 21, 2016 at 11:20:46 AM UTC-8, Paul Dix wrote:
> >
> > > Compactions shouldn't cause write timeouts. I would suspect that write
> timeouts are happening because you're posting too many values per request.
> You can also try increasing the max WAL cache size.
> >
> > >
> >
> > >
> >
> > > How many actual values are you writing per request? That is field
> values. For example:
> >
> > >
> >
> > >
> >
> > > cpu,host=serverA usage_user=23,usage_system=5
> >
> > >
> >
> > >
> >
> > > Represents 2 values posted, not one. That might help narrow things
> down.
> >
> > >
> >
> > >
> >
> >
> >
> > > On Wed, Dec 21, 2016 at 1:09 PM, Jeff <[email protected]>
> wrote:
> >
> > > Facing an interesting problem with my current InfluxDB single instance
> deployment. I'm running on an 8 core machine with 8GB RAM (physical
> hardware) with InfluxDB v1.1.1 running in a docker container.
> >
> > >
> >
> > >
> >
> > >
> >
> > > I'm writing 520 points in batches of 1560 every 3 seconds to a
> Retention Policy of "1w" with a "1d" shard group duration. Each point
> contains about 50 fields of data. Total in the measurement, there are 115
> fields. So for any given point, most of the fields are empty, but over a
> selection of all series, all fields are used.
> >
> > >
> >
> > >
> >
> > >
> >
> > > There's 1 tag in the measurement with about 520 series. I've got 1
> ContinuousQuery configured to run every 3 minutes. The CQ is *massive*. It
> looks something like this:
> >
> > >
> >
> > > "CREATE CONTINUOUS QUERY "\"3m\"" ON MyDB BEGIN SELECT mean(val1) AS
> val1, mean(val2) AS val2, .... this continues for ALL 115 fields ... INTO
> MyDB."16w".devices FROM MyDB."1w".devices GROUP BY time(3m), device END"
> >
> > >
> >
> > >
> >
> > >
> >
> > > Surprisingly I don't think the CQ is causing too much of a performance
> issue at the moment. Instead what I'm seeing in the influx logs is the
> following:
> >
> > >
> >
> > >
> >
> > >
> >
> > > Dec 21 18:08:30 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:30
> beginning level 3 compaction of group 0, 4 TSM files
> >
> > >
> >
> > > Dec 21 18:08:30 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:30
> compacting level 3 group (0) /var/lib/influxdb/data/
> hostname/1w/1212/000000773-000000003.tsm (#0)
> >
> > >
> >
> > > Dec 21 18:08:30 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:30
> compacting level 3 group (0) /var/lib/influxdb/data/
> hostname/1w/1212/000000777-000000003.tsm (#1)
> >
> > >
> >
> > > Dec 21 18:08:30 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:30
> compacting level 3 group (0) /var/lib/influxdb/data/
> hostname/1w/1212/000000781-000000003.tsm (#2)
> >
> > >
> >
> > > Dec 21 18:08:30 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:30
> compacting level 3 group (0) /var/lib/influxdb/data/
> hostname/1w/1212/000000785-000000003.tsm (#3)
> >
> > >
> >
> > > Dec 21 18:08:37 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:37
> compacted level 3 group (0) into /var/lib/influxdb/data/hostname/1w/1212/
> 000000785-000000004.tsm.tm
> >
> > >
> >
> > > Dec 21 18:08:37 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:37
> compacted level 3 4 files into 1 files in 6.339871251s
> >
> > >
> >
> > > Dec 21 18:08:37 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:37
> beginning full compaction of group 0, 2 TSM files
> >
> > >
> >
> > > Dec 21 18:08:37 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:37
> compacting full group (0) /var/lib/influxdb/data/
> hostname/1w/1212/000000769-000000005.tsm (#0)
> >
> > >
> >
> > > Dec 21 18:08:37 hostname influxdb[3119]: [tsm1] 2016/12/21 18:08:37
> compacting full group (0) /var/lib/influxdb/data/
> hostname/1w/1212/000000785-000000004.tsm (#1)
> >
> > >
> >
> > > Dec 21 18:09:00 hostname influxdb[3119]: [tsm1] 2016/12/21 18:09:00
> compacted full group (0) into /var/lib/influxdb/data/
> hostname/1w/1212/000000785-000000005.tsm.tmp (
> >
> > >
> >
> > > Dec 21 18:09:00 hostname influxdb[3119]: [tsm1] 2016/12/21 18:09:00
> compacted full 2 files into 1 files in 23.549201117s
> >
> > >
> >
> > >
> >
> > >
> >
> > > Not only do those compaction times seem very long (23.5 seconds?) but
> while that full compaction is being performed, I'm getting "timeout" on
> writes. That is, it starts taking longer than 10 seconds (default influx
> http write timeout) for the write to be performed/acknowledged by influx.
> I've seen the full compaction times hover around 30s consistently and seem
> to happen about once every 30 minutes.
> >
> > >
> >
> > >
> >
> > >
> >
> > > The influxDB instance seems to be using all available RAM on the
> machine. I had to cap the docker container at 6GB memory usage in order to
> not starve the rest of the system of resources.
> >
> > >
> >
> > >
> >
> > >
> >
> > > Here's a copy of my logs noting very long write times in conjunction
> with a full compaction occurring on the database:
> >
> > >
> >
> > > Process log (write duration is in ms):
> >
> > >
> >
> > > Dec 21 12:28:42 hostname process[11361]: 2016-12-21T12:28:42.615Z -
> warn: db long write duration: 9824
> >
> > >
> >
> > > Dec 21 12:28:44 hostname process[11361]: 2016-12-21T12:28:44.106Z -
> warn: db long write duration: 8242
> >
> > >
> >
> > > Dec 21 12:28:44 hostname process[11361]: 2016-12-21T12:28:44.214Z -
> warn: db long write duration: 5260
> >
> > >
> >
> > > Dec 21 12:28:44 hostname process[11361]: 2016-12-21T12:28:44.314Z -
> warn: db long write duration: 2273
> >
> > >
> >
> > > Dec 21 12:29:23 hostname process[11361]: 2016-12-21T12:29:23.667Z -
> warn: db long write duration: 5044
> >
> > >
> >
> > > Dec 21 12:29:24 hostname process[11361]: 2016-12-21T12:29:24.710Z -
> warn: db long write duration: 3036
> >
> > >
> >
> > > Dec 21 12:29:54 hostname process[11361]: 2016-12-21T12:29:54.533Z -
> warn: db long write duration: 2393
> >
> > >
> >
> > > Dec 21 12:29:56 hostname process[11361]: 2016-12-21T12:29:56.793Z -
> warn: db long write duration: 1588
> >
> > >
> >
> > > Dec 21 12:30:33 hostname process[11361]: 2016-12-21T12:30:33.274Z -
> warn: db long write duration: 1513
> >
> > >
> >
> > >
> >
> > >
> >
> > > Influx log:
> >
> > >
> >
> > > Dec 21 12:28:22 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:22
> compacted level 3 group (0) into /var/lib/influxdb/data/hostname/1w/1212/
> 000000529-000000004.tsm.tm
> >
> > >
> >
> > > Dec 21 12:28:22 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:22
> compacted level 3 8 files into 1 files in 13.399871009s
> >
> > >
> >
> > > Dec 21 12:28:22 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:22
> beginning full compaction of group 0, 2 TSM files
> >
> > >
> >
> > > Dec 21 12:28:22 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:22
> compacting full group (0) /var/lib/influxdb/data/
> hostname/1w/1212/000000513-000000005.tsm (#0)
> >
> > >
> >
> > > Dec 21 12:28:22 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:22
> compacting full group (0) /var/lib/influxdb/data/
> hostname/1w/1212/000000529-000000004.tsm (#1)
> >
> > >
> >
> > > Dec 21 12:28:44 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:44
> compacted full group (0) into /var/lib/influxdb/data/
> hostname/1w/1212/000000529-000000005.tsm.tmp (
> >
> > >
> >
> > > Dec 21 12:28:44 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:44
> compacted full 2 files into 1 files in 21.447891815s
> >
> > >
> >
> > > Dec 21 12:28:44 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:44
> beginning full compaction of group 0, 2 TSM files
> >
> > >
> >
> > > Dec 21 12:28:44 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:44
> compacting full group (0) /var/lib/influxdb/data/
> hostname/1w/1212/000000337-000000006.tsm (#0)
> >
> > >
> >
> > > Dec 21 12:28:44 hostname influxdb[3119]: [tsm1] 2016/12/21 12:28:44
> compacting full group (0) /var/lib/influxdb/data/
> hostname/1w/1212/000000529-000000005.tsm (#1)
> >
> > >
> >
> > > Dec 21 12:29:26 hostname influxdb[3119]: [tsm1] 2016/12/21 12:29:26
> Snapshot for path /var/lib/influxdb/data/hostname/1w/1212 written in
> 788.281773ms
> >
> > >
> >
> > > Dec 21 12:30:04 hostname influxdb[3119]: [tsm1] 2016/12/21 12:30:04
> Snapshot for path /var/lib/influxdb/data/hostname/16w/1213 written in
> 985.274321ms
> >
> > >
> >
> > >
> >
> > >
> >
> > > Is there anything I can do to help these compaction times be shorter?
> Would having smaller shard groups (maybe 1h instead of 1d) help? Is the
> sheer number of fields causing a problem? I could potentially break up the
> measurement into multiple such that no one measurement has more than about
> 50 fields.
> >
> > >
> >
> > >
> >
> > >
> >
> > > Thanks for any suggestions!
> >
> > >
> >
> > >
> >
> > >
> >
> > > --
> >
> > >
> >
> > > Remember to include the version number!
> >
> > >
> >
> > > ---
> >
> > >
> >
> > > You received this message because you are subscribed to the Google
> Groups "InfluxData" group.
> >
> > >
> >
> > > To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected].
> >
> > >
> >
> > > To post to this group, send email to [email protected].
> >
> > >
> >
> > > Visit this group at https://groups.google.com/group/influxdb.
> >
> > >
> >
> > > To view this discussion on the web visit https://groups.google.com/d/
> msgid/influxdb/7c735db5-5c40-43ab-946a-f0a98a231adf%40googlegroups.com.
> >
> > >
> >
> > > For more options, visit https://groups.google.com/d/optout.
> >
> >
> >
> > --
> >
> > Remember to include the version number!
> >
> > ---
> >
> > You received this message because you are subscribed to the Google
> Groups "InfluxData" group.
> >
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected].
> >
> > To post to this group, send email to [email protected].
> >
> > Visit this group at https://groups.google.com/group/influxdb.
> >
> > To view this discussion on the web visit https://groups.google.com/d/
> msgid/influxdb/a55751cd-4ec2-4cd2-9e15-d0984e4b23d1%40googlegroups.com.
> >
> >
> >
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> Remember to include the version number!
> ---
> You received this message because you are subscribed to the Google Groups
> "InfluxData" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/influxdb.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/influxdb/3559f0e0-3c8b-44c5-9750-90460e69473e%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/CAHRno_yVEKJ89A_FDz0URU94oTjQQwSmh0C0%3D1zVhCC97Z%2B5iw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [influxdb] Full Compaction taking 30s and timing out writes and queries

Reply via email to