YongGang commented on PR #16442:
URL: https://github.com/apache/druid/pull/16442#issuecomment-2116659070

   > Why limit this functionality just to stuck supervisors because of parse 
exceptions? If a supervisor is failing continuously (because of OOMs or 
something else) and the offsets to read from are not moving forward, it seems 
like stopping the supervisor and notifying the operator would be a good thing 
to do.
   
   In the case of task OOM errors occurring in the MiddleManager, the issue 
might stem from other processes running on the same host, which temporarily 
consume excessive memory. When these processes terminate—say, within 10 
minutes—the memory pressure might alleviate, allowing tasks to resume normal 
operation. Therefore, halting task creation in response to transient or 
uncertain issues like OOM may not always be appropriate, as the situation might 
resolve itself without needing operator intervention.
   In contrast, parse exceptions directly impact the Supervisor's ability to 
process data correctly and are not self-resolving. These issues unequivocally 
require operator intervention to adjust parsing logic or data format 
expectations. In such cases, stopping the Supervisor and notifying the operator 
ensures that no further tasks fail due to the same unaddressed issue, which is 
a necessary measure.
   This approach balances responsiveness to critical, actionable issues with 
avoidance of unnecessary interruptions in cases where conditions may soon 
return to normal.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to