[prometheus-users] All Samples Lost when prometheus server return 500 to prometheus agent

koly li Fri, 17 May 2024 02:02:55 -0700

Hello all,

Recently we found that our samples are all lost. After some investigation, 
we found:
1, we are using prometheus agent to send all data to prometheus server by 
remote write
2, the agent sample sending code is in storage\remote\queue_manager.go, the 
function is sendWriteRequestWithBackoff()
3, inside the function, if attempt(the function where request is made to 
prometheus server) function returns an Recoverable Error, then it will 
retry sending the request
4, when a Recoverable error is returned? one scenario is the prometheus 
server returned 5xx error
5, I think not every 5xx error is recoverable, and there is no other way to 
exit the for loop in sendWriteRequestWithBackoff(). The agent keeps 
retrying but every time it receives an 5xx from the server. so we lost all 
samples for hours until we restart the agent


So my question is why 5xx from the promtheus server is considered 
Recoverable? And I believe there should be a way to exit the loop, for 
example a maximum times to  retry.

It seems that the agent mode is not mature enough to work in production.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/caa3c922-6638-40fc-95a8-95f2b6d4e36dn%40googlegroups.com.

[prometheus-users] All Samples Lost when prometheus server return 500 to prometheus agent

Reply via email to