One detail I forgot to mention which might be relevant: on Fri (Sep 10) we 
had an issue where Prometheus server was failing to write new metrics to 
disk because it was hitting disk capacity limits. We were seeing this in 
the logs:

level=error ts=2021-09-10T19:16:01.392Z caller=scrape.go:1085 
component="scrape manager" scrape_pool=kubernetes-service-endpoints 
target=http://10.138.0.102:19100/metrics msg="Scrape commit failed" err="write 
to WAL: log samples: write /data/wal/00007614: no space left on device"

We resolved this by increasing the size of the PersistentVolumeClaim from 
32GB to 64GB. We ran into the issue again on Mon, Sep 20 and resolved it 
once again by doubling the storage capacity from 64GB to 128GB.

Is Prometheus designed to be resilient to running out of disk space? Or 
could that be part of what caused the corruption?
On Wednesday, September 22, 2021 at 10:53:59 PM UTC-4 Brandon Duffany wrote:

> After our Prometheus server restarted today (Sep 22), we noticed that 11 
> days worth of data somehow got deleted from TSDB, from Sep 10 through Sep 
> 21. (We are running Prometheus on Kubernetes using a persistent volume for 
> the TSDB data directory.)
>
> I think the data was actually deleted by Prometheus itself, because we saw 
> the disk usage at 50GB just after Prometheus server started, but then 
> dropped to around 8GB shortly after.
>
> Furthermore, we saw the following in Prometheus server logs:
>
> level=warn ts=2021-09-22T16:24:09.931Z caller=db.go:662 component=tsdb 
> msg="Encountered 
> WAL read error, attempting repair" err="read records: corruption in 
> segment /data/wal/00007611 at 25123899: unexpected checksum a70d7089, 
> expected 30cb982b"
>
> level=warn ts=2021-09-22T16:24:09.931Z caller=wal.go:354 component=tsdb 
> msg="Starting corruption repair" segment=7611 offset=25123899
>
> level=warn ts=2021-09-22T16:24:09.933Z caller=wal.go:362 component=tsdb 
> msg="Deleting all segments newer than corrupted segment" segment=7611 
>
>
> And if I look at the TSDB data volume (/data) I see that there are a bunch 
> of data directories which look like they got deleted:
>
> /data $ stat -c '%y %n' * | sort
> 2020-11-24 18:17:06.000000000 lost+found
> 2020-11-24 18:17:14.000000000 lock
> 2021-09-08 09:02:11.000000000 01FF2A7WB036B9VEN19TXFN6K6
> 2021-09-09 03:03:33.000000000 01FF47ZTKY1SJBPKT11VED354K
> 2021-09-09 21:06:33.000000000 01FF65WM7VS00ZQ98C1W0GAJ99
> 2021-09-10 20:52:19.000000000 01FF8QK0AF514208XDW7W576QD
> 2021-09-22 17:01:29.000000000 01FG776WNKM6W4ZPADSG20GK6H
> 2021-09-22 21:00:36.000000000 01FG7MXRH65JXVR46PD0SWE951
> 2021-09-22 21:01:30.000000000 01FG7MZF9EC1HMW57K9GR866WY
> 2021-09-22 23:00:30.000000000 01FG7VSG0FBFFXWS1QJ68208HW
> 2021-09-23 01:00:01.000000000 01FG82N78GAX6C31MMWR14K8BH
> 2021-09-23 01:00:01.000000000 chunks_head
> 2021-09-23 01:00:01.000000000 wal
> 2021-09-23 02:42:48.000000000 queries.active
>
>
> Has anyone run into similar issues before or know why this data corruption 
> might be happening?
>
> Or, is there anywhere we can look for hints as to why TSDB thought the 
> data was corrupted and removed so many of the chunks (over 40GB of data in 
> our case?)
>
> Any help would be greatly appreciated. Thanks!
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/19423814-230d-42a0-ade4-c90abc272cc0n%40googlegroups.com.

Reply via email to