Re: [prometheus-users] Re: Large portions of prometheus TSDB data getting corrupted and deleted

Brandon Duffany Fri, 24 Sep 2021 10:05:15 -0700


On Thursday, September 23, 2021 at 11:45:37 AM UTC-4 [email protected] wrote:


> There have been several WAL corruption fixes since that version.
>
>
OK, that's good to know. If this happens again, we'll try upgrading our 
Prometheus version and seeing if that fixes it. Thanks!
 

> On Thu, Sep 23, 2021 at 5:23 PM Brandon Duffany <[email protected]> 
> wrote:
>
>> Yep --
>>
>> Prometheus: 2.22.1 -- revision 00f16d1ac3a4c94561e5133b821d8e4d9ef78ec2
>>
>> Filesystem: ext4
>>
>>
>> On Thursday, September 23, 2021 at 12:18:21 AM UTC-4 Julien Pivotto wrote:
>>
>>> Can we know the filesystem you use and your Prometheus version?
>>>
>>> Le jeu. 23 sept. 2021 à 06:06, Brandon Duffany <[email protected]> a 
>>> écrit :
>>>
>>>>
>>>> One detail I forgot to mention which might be relevant: on Fri (Sep 10) 
>>>> we had an issue where Prometheus server was failing to write new metrics 
>>>> to 
>>>> disk because it was hitting disk capacity limits. We were seeing this in 
>>>> the logs:
>>>>
>>>> level=error ts=2021-09-10T19:16:01.392Z caller=scrape.go:1085 
>>>> component="scrape manager" scrape_pool=kubernetes-service-endpoints target=
>>>> http://10.138.0.102:19100/metrics msg="Scrape commit failed" err="write 
>>>> to WAL: log samples: write /data/wal/00007614: no space left on device"
>>>>
>>>> We resolved this by increasing the size of the PersistentVolumeClaim 
>>>> from 32GB to 64GB. We ran into the issue again on Mon, Sep 20 and resolved 
>>>> it once again by doubling the storage capacity from 64GB to 128GB.
>>>>
>>>> Is Prometheus designed to be resilient to running out of disk space? Or 
>>>> could that be part of what caused the corruption?
>>>> On Wednesday, September 22, 2021 at 10:53:59 PM UTC-4 Brandon Duffany 
>>>> wrote:
>>>>
>>>>> After our Prometheus server restarted today (Sep 22), we noticed that 
>>>>> 11 days worth of data somehow got deleted from TSDB, from Sep 10 through 
>>>>> Sep 21. (We are running Prometheus on Kubernetes using a persistent 
>>>>> volume 
>>>>> for the TSDB data directory.)
>>>>>
>>>>> I think the data was actually deleted by Prometheus itself, because we 
>>>>> saw the disk usage at 50GB just after Prometheus server started, but then 
>>>>> dropped to around 8GB shortly after.
>>>>>
>>>>> Furthermore, we saw the following in Prometheus server logs:
>>>>>
>>>>> level=warn ts=2021-09-22T16:24:09.931Z caller=db.go:662 component=tsdb 
>>>>> msg="Encountered WAL read error, attempting repair" err="read 
>>>>> records: corruption in segment /data/wal/00007611 at 25123899: unexpected 
>>>>> checksum a70d7089, expected 30cb982b"
>>>>>
>>>>> level=warn ts=2021-09-22T16:24:09.931Z caller=wal.go:354 
>>>>> component=tsdb msg="Starting corruption repair" segment=7611 
>>>>> offset=25123899
>>>>>
>>>>> level=warn ts=2021-09-22T16:24:09.933Z caller=wal.go:362 
>>>>> component=tsdb msg="Deleting all segments newer than corrupted 
>>>>> segment" segment=7611 
>>>>>
>>>>>
>>>>> And if I look at the TSDB data volume (/data) I see that there are a 
>>>>> bunch of data directories which look like they got deleted:
>>>>>
>>>>> /data $ stat -c '%y %n' * | sort
>>>>> 2020-11-24 18:17:06.000000000 lost+found
>>>>> 2020-11-24 18:17:14.000000000 lock
>>>>> 2021-09-08 09:02:11.000000000 01FF2A7WB036B9VEN19TXFN6K6
>>>>> 2021-09-09 03:03:33.000000000 01FF47ZTKY1SJBPKT11VED354K
>>>>> 2021-09-09 21:06:33.000000000 01FF65WM7VS00ZQ98C1W0GAJ99
>>>>> 2021-09-10 20:52:19.000000000 01FF8QK0AF514208XDW7W576QD
>>>>> 2021-09-22 17:01:29.000000000 01FG776WNKM6W4ZPADSG20GK6H
>>>>> 2021-09-22 21:00:36.000000000 01FG7MXRH65JXVR46PD0SWE951
>>>>> 2021-09-22 21:01:30.000000000 01FG7MZF9EC1HMW57K9GR866WY
>>>>> 2021-09-22 23:00:30.000000000 01FG7VSG0FBFFXWS1QJ68208HW
>>>>> 2021-09-23 01:00:01.000000000 01FG82N78GAX6C31MMWR14K8BH
>>>>> 2021-09-23 01:00:01.000000000 chunks_head
>>>>> 2021-09-23 01:00:01.000000000 wal
>>>>> 2021-09-23 02:42:48.000000000 queries.active
>>>>>
>>>>>
>>>>> Has anyone run into similar issues before or know why this data 
>>>>> corruption might be happening?
>>>>>
>>>>> Or, is there anywhere we can look for hints as to why TSDB thought the 
>>>>> data was corrupted and removed so many of the chunks (over 40GB of data 
>>>>> in 
>>>>> our case?)
>>>>>
>>>>> Any help would be greatly appreciated. Thanks!
>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Prometheus Users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/prometheus-users/19423814-230d-42a0-ade4-c90abc272cc0n%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/prometheus-users/19423814-230d-42a0-ade4-c90abc272cc0n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/40aa0db9-7297-4ede-bcb1-37cf29c5f6d1n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-users/40aa0db9-7297-4ede-bcb1-37cf29c5f6d1n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6d6d408f-05a8-419c-a30d-1d4e6c6939a5n%40googlegroups.com.

Re: [prometheus-users] Re: Large portions of prometheus TSDB data getting corrupted and deleted

Reply via email to