[prometheus-users] Large portions of prometheus TSDB data getting corrupted and deleted

Brandon Duffany Wed, 22 Sep 2021 19:54:05 -0700

After our Prometheus server restarted today (Sep 22), we noticed that 11 
days worth of data somehow got deleted from TSDB, from Sep 10 through Sep 
21. (We are running Prometheus on Kubernetes using a persistent volume for 
the TSDB data directory.)


I think the data was actually deleted by Prometheus itself, because we saw 
the disk usage at 50GB just after Prometheus server started, but then 
dropped to around 8GB shortly after.

Furthermore, we saw the following in Prometheus server logs:

level=warn ts=2021-09-22T16:24:09.931Z caller=db.go:662 component=tsdb 
msg="Encountered 
WAL read error, attempting repair" err="read records: corruption in segment 
/data/wal/00007611 at 25123899: unexpected checksum a70d7089, expected 
30cb982b"

level=warn ts=2021-09-22T16:24:09.931Z caller=wal.go:354 component=tsdb 
msg="Starting 
corruption repair" segment=7611 offset=25123899

level=warn ts=2021-09-22T16:24:09.933Z caller=wal.go:362 component=tsdb 
msg="Deleting 
all segments newer than corrupted segment" segment=7611 


And if I look at the TSDB data volume (/data) I see that there are a bunch 
of data directories which look like they got deleted:

/data $ stat -c '%y %n' * | sort
2020-11-24 18:17:06.000000000 lost+found
2020-11-24 18:17:14.000000000 lock
2021-09-08 09:02:11.000000000 01FF2A7WB036B9VEN19TXFN6K6
2021-09-09 03:03:33.000000000 01FF47ZTKY1SJBPKT11VED354K
2021-09-09 21:06:33.000000000 01FF65WM7VS00ZQ98C1W0GAJ99
2021-09-10 20:52:19.000000000 01FF8QK0AF514208XDW7W576QD
2021-09-22 17:01:29.000000000 01FG776WNKM6W4ZPADSG20GK6H
2021-09-22 21:00:36.000000000 01FG7MXRH65JXVR46PD0SWE951
2021-09-22 21:01:30.000000000 01FG7MZF9EC1HMW57K9GR866WY
2021-09-22 23:00:30.000000000 01FG7VSG0FBFFXWS1QJ68208HW
2021-09-23 01:00:01.000000000 01FG82N78GAX6C31MMWR14K8BH
2021-09-23 01:00:01.000000000 chunks_head
2021-09-23 01:00:01.000000000 wal
2021-09-23 02:42:48.000000000 queries.active


Has anyone run into similar issues before or know why this data corruption 
might be happening?

Or, is there anywhere we can look for hints as to why TSDB thought the data 
was corrupted and removed so many of the chunks (over 40GB of data in our 
case?)

Any help would be greatly appreciated. Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/20038bb6-a141-46db-9667-b96d95c20400n%40googlegroups.com.

[prometheus-users] Large portions of prometheus TSDB data getting corrupted and deleted

Reply via email to