Re: [influxdb] Internal server error, timeout and unusable server after large imports

Sean Beckett Tue, 11 Oct 2016 12:12:30 -0700

On Tue, Oct 11, 2016 at 12:11 AM, <[email protected]> wrote:


> Hi,
>
> It seems that the old issue might have surfaced again (#3349) in v1.0.
>
> I tried to insert a large number of records (3913595) via a script,
> inserting 10000 rows at a time.
>
> After a while I received
>
> HTTP/1.1 500 Internal Server Error
> Content-Type: application/json
> Request-Id: ac8ebbbe-8f70-11e6-8ce7-000000000000
> X-Influxdb-Version: 1.0.0
> Date: Tue, 11 Oct 2016 05:12:02 GMT
> Content-Length: 20
>
> {"error":"timeout"}
> HTTP/1.1 100 Continue
>
> I killed the process, after which the whole box became pretty much
> unresponsive.
>

Killed the InfluxDB process, or the batch writing script process?


>
> There is nothing in the logs (i.e. sudo ls /var/log/influxdb/ gives me
> nothing) although the setting for http logging is true:
>

systemd OSes put the logs in a new place (yay!?). See
http://docs.influxdata.com/influxdb/v1.0/administration/logs/#systemd for
how to read the logs.


>
> [http]
>   enabled = true
>   bind-address = ":8086"
>   auth-enabled = true
>   log-enabled = true
>
> I tried to restart influx, but got the following error:
>
> Failed to connect to http://localhost:8086
> Please check your connection settings and ensure 'influxd' is running.
>

The `influx` console is just a fancy wrapper on the API. That error doesn't
mean much except that the HTTP listener in InfluxDB is not yet up and
running.


>
> Although I can see that influxd is up an running:
>
> > systemctl | grep influx
> influxdb.service
>                 loaded active running   InfluxDB is an open-source,
> distributed, time series database
>
> What do I do now?
>

Check the logs as referenced above.

The non-responsiveness on startup isn't surprising. It sounds like the
system was overwhelmed with writes, which means that the WAL would have
many points cached, waiting to be flushed to disk. On restart, InfluxDB
won't accept new writes or queries until the cached ones in the WAL have
persisted. For this reason, the HTTP listener is off until the WAL is
flushed.


>
> I tried the same import over the weekend, then the script timeout happened
> eventually but the result was the same unresponsive, unusable server. We
> rebuilt the box and started again.
>

It sounds like the box is just overwhelmed. Did you get backoff messages
from the writes before the crash? What are the machine specs?



>
> Perhaps it is worthwhile mentioning that the same measurement already
> contained about 9 million records. Some of these records had the same
> timestamp as the ones I tried to import, i.e. they should have been merged.
>

Overwriting points is much much more expensive than posting new points.
Each overwritten point triggers a tombstone record which must later be
processed. This can trigger frequent compactions of the TSM files. With a
high write load and frequent compactions, the system would encounter
significant CPU pressure.


>
> Interestingly enough the same amount of data was fine when I forgot to add
> precision in ms, i.e. all records were imported as nanoseconds, but in fact
> they "lacked" 6 zeroes.
>

That would mean all points are going to the same shard. It is more resource
intensive to load points across a wide range of time, since more shard
files are involved. InfluxDB does best with sequential chronologically
ordered unique points from the very recent past. The more the write
operation differs from that, the lower the throughput.


>
> Please advise what kind of action I can take.
>

Look in the logs for errors. Throttle the writes. Don't overwrite more
points than you have to.


>
> Thanks a lot!
> Tanya
>
> --
> Remember to include the InfluxDB version number with all issue reports
> ---
> You received this message because you are subscribed to the Google Groups
> "InfluxDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/influxdb.
> To view this discussion on the web visit https://groups.google.com/d/ms
> gid/influxdb/f4ebdb56-32f9-4fb6-88de-f7ef603c4262%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Sean Beckett
Director of Support and Professional Services
InfluxDB

-- 
Remember to include the version number!
--- 
You received this message because you are subscribed to the Google Groups 
"InfluxData" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/influxdb.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/influxdb/CALGqCvMCu%3DM9eR5NOky-LRAiqRU5cnCDJa0SBjRrz5_Wt0tT8g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [influxdb] Internal server error, timeout and unusable server after large imports

Reply via email to