Re: [prometheus-users] Re: Large portions of prometheus TSDB data getting corrupted and deleted

Jiacai Liu Thu, 23 Sep 2021 08:57:09 -0700

Wal repair logic is here:https://github.com/prometheus/prometheus/blob/v2.22.1/tsdb/wal/wal.go#L339.


The func comment says it will discards all data after corruption.

According to your log at 09-10, the wal at that time has highprobility of getting corrupted, so prometheus will

delete all data after that time when restart.

But I have one question with regarding to this issue:

1. It seems prometheus will not compact wal to persistent blocksonce it get corrupted



On Thu, Sep 23, 2021 at 11:23:53 PM +0800, Brandon Duffany wrote:

Yep --
Prometheus: 2.22.1 -- revision00f16d1ac3a4c94561e5133b821d8e4d9ef78ec2
Filesystem: ext4
On Thursday, September 23, 2021 at 12:18:21 AM UTC-4 JulienPivotto wrote:
Can we know the filesystem you use and your Prometheus version?
Le jeu. 23 sept. 2021 à 06:06, Brandon Duffany<[email protected]> a
écrit :
One detail I forgot to mention which might be relevant: on Fri(Sep 10)we had an issue where Prometheus server was failing to writenew metrics todisk because it was hitting disk capacity limits. We wereseeing this in
the logs:

level=error ts=2021-09-10T19:16:01.392Z caller=scrape.go:1085
component="scrape manager"scrape_pool=kubernetes-service-endpoints target=http://10.138.0.102:19100/metrics msg="Scrape commit failed"err="writeto WAL: log samples: write /data/wal/00007614: no space lefton device"
We resolved this by increasing the size of thePersistentVolumeClaim from32GB to 64GB. We ran into the issue again on Mon, Sep 20 andresolved itonce again by doubling the storage capacity from 64GB to128GB.
Is Prometheus designed to be resilient to running out of diskspace? Or
could that be part of what caused the corruption?
On Wednesday, September 22, 2021 at 10:53:59 PM UTC-4 BrandonDuffany
wrote:
After our Prometheus server restarted today (Sep 22), wenoticed that 11days worth of data somehow got deleted from TSDB, from Sep 10through Sep21. (We are running Prometheus on Kubernetes using apersistent volume for
the TSDB data directory.)
I think the data was actually deleted by Prometheus itself,because wesaw the disk usage at 50GB just after Prometheus serverstarted, but then
dropped to around 8GB shortly after.

Furthermore, we saw the following in Prometheus server logs:
level=warn ts=2021-09-22T16:24:09.931Z caller=db.go:662component=tsdbmsg="Encountered WAL read error, attempting repair" err="readrecords:corruption in segment /data/wal/00007611 at 25123899:unexpected checksum
a70d7089, expected 30cb982b"
level=warn ts=2021-09-22T16:24:09.931Z caller=wal.go:354component=tsdb
msg="Starting corruption repair" segment=7611 offset=25123899
level=warn ts=2021-09-22T16:24:09.933Z caller=wal.go:362component=tsdbmsg="Deleting all segments newer than corrupted segment"segment=7611
And if I look at the TSDB data volume (/data) I see thatthere are a
bunch of data directories which look like they got deleted:

/data $ stat -c '%y %n' * | sort
2020-11-24 18:17:06.000000000 lost+found
2020-11-24 18:17:14.000000000 lock
2021-09-08 09:02:11.000000000 01FF2A7WB036B9VEN19TXFN6K6
2021-09-09 03:03:33.000000000 01FF47ZTKY1SJBPKT11VED354K
2021-09-09 21:06:33.000000000 01FF65WM7VS00ZQ98C1W0GAJ99
2021-09-10 20:52:19.000000000 01FF8QK0AF514208XDW7W576QD
2021-09-22 17:01:29.000000000 01FG776WNKM6W4ZPADSG20GK6H
2021-09-22 21:00:36.000000000 01FG7MXRH65JXVR46PD0SWE951
2021-09-22 21:01:30.000000000 01FG7MZF9EC1HMW57K9GR866WY
2021-09-22 23:00:30.000000000 01FG7VSG0FBFFXWS1QJ68208HW
2021-09-23 01:00:01.000000000 01FG82N78GAX6C31MMWR14K8BH
2021-09-23 01:00:01.000000000 chunks_head
2021-09-23 01:00:01.000000000 wal
2021-09-23 02:42:48.000000000 queries.active
Has anyone run into similar issues before or know why thisdata
corruption might be happening?
Or, is there anywhere we can look for hints as to why TSDBthought thedata was corrupted and removed so many of the chunks (over40GB of data in
our case?)

Any help would be greatly appreciated. Thanks!

--
You received this message because you are subscribed to theGoogle Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails fromit, send an
email to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/19423814-230d-42a0-ade4-c90abc272cc0n%40googlegroups.com
<https://groups.google.com/d/msgid/prometheus-users/19423814-230d-42a0-ade4-c90abc272cc0n%40googlegroups.com?utm_medium=email&utm_source=footer>
.


--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/614ca3d0.1c69fb81.5e191.7e06SMTPIN_ADDED_BROKEN%40gmr-mx.google.com.

Re: [prometheus-users] Re: Large portions of prometheus TSDB data getting corrupted and deleted

Reply via email to