It's unlikely that any individual write triggered an OOM - it's almost always a query that didn't have bounds on some large cardinality series.
Query logging should be turned on by default. If you've narrowed down what time the OOM killer killed influxdb, you should be able to check the influxdb logs and look at the queries that happened shortly before that time. Keep an eye out for GROUP BY * against a high cardinality series, queries without a time bound, or use of aggregates that require the entire dataset to be loaded, e.g. MEDIAN. On Thu, Dec 29, 2016 at 1:12 PM, Jeffery K < [email protected]> wrote: > confirmed in var/log it was the oom killer. > We don't have any particularly large data queries/ or long running, but we > do have a lot of writes. Any tips on narrowing down what query/write may > have caused it? > > /var/log/messages:Dec 29 10:44:12 influxdb1 kernel: influxd invoked > oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0 > /var/log/messages:Dec 29 10:44:12 influxdb1 kernel: [<ffffffff8116cdee>] > oom_kill_process+0x24e/0x3b0 > /var/log/messages:Dec 29 10:44:12 influxdb1 kernel: [ pid ] uid tgid > total_vm rss nr_ptes swapents oom_score_adj name > /var/log/messages:Dec 29 10:56:51 influxdb1 kernel: influxd invoked > oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0 > /var/log/messages:Dec 29 10:56:51 influxdb1 kernel: [<ffffffff8116cdee>] > oom_kill_process+0x24e/0x3b0 > /var/log/messages:Dec 29 10:56:51 influxdb1 kernel: [ pid ] uid tgid > total_vm rss nr_ptes swapents oom_score_adj name > /var/log/messages:Dec 29 11:05:55 influxdb1 kernel: influxd invoked > oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0 > /var/log/messages:Dec 29 11:05:55 influxdb1 kernel: [<ffffffff8116cdee>] > oom_kill_process+0x24e/0x3b0 > /var/log/messages:Dec 29 11:05:55 influxdb1 kernel: [ pid ] uid tgid > total_vm rss nr_ptes swapents oom_score_adj name > /var/log/messages:Dec 29 11:18:03 influxdb1 kernel: influxd invoked > oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0 > /var/log/messages:Dec 29 11:18:03 influxdb1 kernel: [<ffffffff8116cdee>] > oom_kill_process+0x24e/0x3b0 > > > On Thursday, December 29, 2016 at 2:37:25 PM UTC-5, Mark Rushakoff wrote: >> >> It was most likely the OOM killer kicking in during an out-of-control >> query. Confirming whether it was the OOM killer varies by distribution [1]. >> It's normal for systemd to restart a service that dies. >> >> Besides simply avoiding problematic unbounded queries (such as `SELECT * >> FROM /.*/ GROUP BY *`), there are some configuration options within the >> coordinator section that you can set [2] to prevent queries from >> overconsuming resources. >> >> [1] https://unix.stackexchange.com/questions/128642/debug- >> out-of-memory-with-var-log-messages?rq=1 >> [2] https://docs.influxdata.com/influxdb/v1.1/administration >> /config/#coordinator >> >> On Thu, Dec 29, 2016 at 11:05 AM, Jeffery K <jeffer...@sightlinesystems. >> com> wrote: >> >>> I had an instance this morning, while influx was under high load, that >>> after about 20-25 minutes, the influxd process restarted, and was launched >>> again, automatically. Is this a feature? >>> >>> Looking at the journalctl, all i saw was this in the log at the time it >>> happened. We are using version 1.1 >>> Dec 29 11:18:04 influxdb1 systemd[1]: influxdb.service: main process >>> exited, code=killed, status=9/KILL >>> Dec 29 11:18:04 influxdb1 systemd[1]: Unit influxdb.service entered >>> failed state. >>> Dec 29 11:18:04 influxdb1 systemd[1]: influxdb.service failed. >>> >>> I've confirmed with others that no one was on the linux system, and no >>> one manually restarted or killed the process. Do this mean it crashed? if >>> it is, how could I confirm that? >>> We had been getting sporadic timeouts on the write API endpoint leading >>> up to this restart. >>> >>> >>> >>> -- >>> Remember to include the version number! >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "InfluxData" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/influxdb. >>> To view this discussion on the web visit https://groups.google.com/d/ms >>> gid/influxdb/ef6f65b4-22f1-4a45-af3d-469a12be9821%40googlegroups.com >>> <https://groups.google.com/d/msgid/influxdb/ef6f65b4-22f1-4a45-af3d-469a12be9821%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > Remember to include the version number! > --- > You received this message because you are subscribed to the Google Groups > "InfluxData" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/influxdb. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/influxdb/0d031a49-4197-425d-be58-d648bc8b692a%40googlegroups.com > <https://groups.google.com/d/msgid/influxdb/0d031a49-4197-425d-be58-d648bc8b692a%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- Remember to include the version number! --- You received this message because you are subscribed to the Google Groups "InfluxData" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/influxdb. To view this discussion on the web visit https://groups.google.com/d/msgid/influxdb/CALxJwdNPwRXfo14EkhWk7NDUC3y49H8QiQvpckwfONnok%2B7prw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
