Hi,
Hoping for some insights on whether we're seeing issues due to scaling, or 
if something is odd.
We've started seeing very long WAL checkpoint durations on our largest 
Prometheus instance. A few days ago it was 20+ minutes, but tonight I got 
paged due to runaway disk usage, and when I checked the logs the last 
checkpoints took well over an hour. The wal/ directory is 70% of total disk 
usage (154G/215G). 

Some instance configuration information:
It is running inside a GKE cluster on a dedicated node, with 8 cores/48G 
memory. We're setting tsdb retention limits to 7d and 200G. We're running 
2.19.1 since a few days, before that 2.18.0.

As mentioned, this is a large instance, scraping ~18k targets and 
prometheus_tsdb_head_series is ~11.6M. Still, this has been working 
relatively stable for a few months, until something caused it to 
destabilize. Maybe we just reached some tipping point?

I've included logs below showing the increased duration over time since the 
last restart ~24h ago.

Let me know if there's any other info I can supply.

Best,
Calle

Logs:
{"caller":"head.go:645","component":"tsdb","level":"info","msg":"Replaying 
WAL and on-disk memory mappable chunks if any, this may take a 
while","ts":"2020-06-24T04:57:27.204Z"} 
{"caller":"head.go:682","component":"tsdb","level":"info","msg":"WAL 
checkpoint loaded","ts":"2020-06-24T05:16:15.391Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93060,"ts":"2020-06-24T05:16:18.986Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93061,"ts":"2020-06-24T05:16:22.684Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93062,"ts":"2020-06-24T05:16:26.556Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93063,"ts":"2020-06-24T05:16:31.028Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93064,"ts":"2020-06-24T05:16:38.358Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93065,"ts":"2020-06-24T05:16:42.451Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93066,"ts":"2020-06-24T05:16:43.950Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93067,"ts":"2020-06-24T05:16:48.091Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93068,"ts":"2020-06-24T05:16:52.752Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93069,"ts":"2020-06-24T05:16:56.866Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93070,"ts":"2020-06-24T05:17:00.974Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93071,"ts":"2020-06-24T05:17:05.708Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93072,"ts":"2020-06-24T05:17:09.800Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93073,"ts":"2020-06-24T05:17:26.553Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93074,"ts":"2020-06-24T05:17:29.828Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93075,"ts":"2020-06-24T05:17:33.221Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93076,"ts":"2020-06-24T05:17:36.731Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93077,"ts":"2020-06-24T05:17:40.160Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93078,"ts":"2020-06-24T05:17:41.211Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93079,"ts":"2020-06-24T05:17:44.711Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93080,"ts":"2020-06-24T05:17:48.304Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93081,"ts":"2020-06-24T05:17:51.804Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93082,"ts":"2020-06-24T05:17:55.257Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93083,"ts":"2020-06-24T05:17:58.633Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93084,"ts":"2020-06-24T05:17:59.922Z"} 
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL
 
segment loaded","segment":93085,"ts":"2020-06-24T05:17:59.922Z"} 
{"caller":"head.go:709","component":"tsdb","duration":"20m32.718268711s","level":"info","msg":"WAL
 
replay completed","ts":"2020-06-24T05:17:59.922Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"10.702313139s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T05:20:02.911Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"27m11.480969261s","first":93060,"last":93076,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T05:47:14.438Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"10.491417535s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T05:49:03.732Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"28m5.634599251s","first":93077,"last":93089,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T06:17:09.386Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"10.926510097s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T06:19:18.329Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"28m55.000491785s","first":93090,"last":93102,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T06:48:13.356Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"10.849142474s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T06:50:06.635Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"29m55.379508706s","first":93103,"last":93115,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T07:20:02.033Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"11.353482447s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T07:22:01.081Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"31m47.883179757s","first":93116,"last":93128,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T07:53:48.997Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"11.379979457s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T07:55:45.663Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"33m43.864027371s","first":93129,"last":93141,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T08:29:29.573Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"11.842346357s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T08:31:55.605Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"34m44.030004224s","first":93142,"last":93156,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T09:06:39.650Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"11.816391001s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T09:08:43.310Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"36m18.979807122s","first":93157,"last":93171,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T09:45:02.312Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"11.679089866s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T09:47:30.274Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"37m28.519721249s","first":93172,"last":93188,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T10:24:58.848Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"11.642761491s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T10:27:02.968Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"38m30.98126742s","first":93189,"last":93205,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T11:05:33.989Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"11.953396425s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T11:07:42.690Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"40m41.750837742s","first":93206,"last":93222,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T11:48:24.524Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"11.924025363s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T11:50:34.565Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"42m36.719910209s","first":93223,"last":93239,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T12:33:11.342Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"11.844043457s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T12:35:28.797Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"43m41.251834633s","first":93240,"last":93258,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T13:19:10.143Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"11.81029093s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T13:21:22.759Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"45m55.549433015s","first":93259,"last":93277,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T14:07:18.370Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"11.705922433s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T14:09:29.681Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"46m36.224783951s","first":93278,"last":93298,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T14:56:05.977Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"11.997460717s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T14:58:21.086Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"48m39.050349407s","first":93299,"last":93319,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T15:47:00.187Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"11.678145052s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T15:49:13.480Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"50m54.193535944s","first":93320,"last":93342,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T16:40:07.722Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"12.087928789s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T16:42:26.278Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"52m35.511813543s","first":93343,"last":93365,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T17:35:01.922Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"12.320304825s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T17:37:26.850Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"55m13.524592427s","first":93366,"last":93389,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T18:32:40.419Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"12.558102685s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T18:35:08.768Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"57m18.998055501s","first":93390,"last":93414,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T19:32:27.826Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"12.331336051s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T19:34:56.045Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"59m26.168988542s","first":93415,"last":93439,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T20:34:22.252Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"11.957175121s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T20:36:47.776Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"1h2m20.433188445s","first":93440,"last":93465,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T21:39:08.314Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"12.899107698s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T21:41:36.530Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"1h5m3.288078386s","first":93466,"last":93492,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T22:46:39.904Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"12.25223606s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T22:49:04.150Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"1h7m12.771595983s","first":93493,"last":93519,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-24T23:56:16.970Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"12.094735046s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-24T23:59:05.108Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"1h9m42.283215414s","first":93520,"last":93548,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-25T01:08:47.404Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"12.847163161s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-25T01:11:21.337Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"1h13m25.658134849s","first":93549,"last":93577,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-25T02:24:47.072Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"12.438982163s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-25T02:27:37.593Z"} 
{"caller":"head.go:869","component":"tsdb","duration":"1h15m12.294907388s","first":93578,"last":93606,"level":"info","msg":"WAL
 
checkpoint complete","ts":"2020-06-25T03:42:50.066Z"} 
{"caller":"head.go:792","component":"tsdb","duration":"12.408087258s","level":"info","msg":"Head
 
GC completed","ts":"2020-06-25T03:45:25.863Z"}

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/74dfe8b6-6e20-41ef-95a9-675fa6e810b1n%40googlegroups.com.

Reply via email to