Hi, this morning we noticed a prometheus server with 3.3TB of metrics stopped to returned metrics older than ~2h30. Disk was still full with 3.3TB of data.
when I restarted the prometheus servers, it started to replay the WAL and find a corrupted segment. Then it deleted all segments after the corrupted one ... at the end the 3.3TB of data have been flushed to 48GB ... I don't understand why a corrupted segment imply deleting all newer segment. To me this make non sense and make the prom tsdb not reliable. I would have expect the TSDB to be rock solid and be able to recover in case of segment corruption or worst case just losing the segment ... no all segments that are newer than the corrupted one. What is the technical reason behind this ? Thanks you Regards ++ Jerome -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9ccc74eb-5d34-43c8-a22e-2869c9a12d17n%40googlegroups.com.