Hi,

this morning we noticed a prometheus server with 3.3TB of metrics stopped 
to returned metrics older than ~2h30. Disk was still full with 3.3TB of 
data.

when I restarted the prometheus servers, it started to replay the WAL and 
find a corrupted segment. Then it deleted all segments after the corrupted 
one ... at the end the 3.3TB of data have been flushed to 48GB ... 

I don't understand why a corrupted segment imply deleting all newer 
segment. To me this make non sense and make the prom tsdb not reliable. I 
would have expect the TSDB to be rock solid and be able to recover in case 
of segment corruption or worst case just losing the segment ... no all 
segments that are newer than the corrupted one.

What is the technical reason behind this ?

Thanks you

Regards
++ Jerome

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9ccc74eb-5d34-43c8-a22e-2869c9a12d17n%40googlegroups.com.

Reply via email to