ForVic commented on PR #52068: URL: https://github.com/apache/spark/pull/52068#issuecomment-3412954635
> To @ForVic , @mridulm and @sunchao , I have two questions. > > 1. Are the failed driver pods supposed to be kept for a while with the diagnosis annotation? > 2. Who are the final audience to be able to see the diagnosis? A user or some other scrapper or automated systems? Could you give us some examples which you are currently using? @dongjoon-hyun 1. Driver pod lifecycle is outside of Spark, so it's free for spark operator or external systems to manage it how they'd like. If they delete instantly than this implementation isn't useful to them. I've seen instances of having a watch on the driver pod and on completion capturing diagnostics and then deleting pod and I've also seen where there is a periodic polling of completed driver pods and then could capture this field and then delete. 2. Both users and automation. It's improved automated tooling for things like auto-memory-scaling on OOM where we use this to capture the error and classify. Also when the user class has an exception that fails the job this makes it easier for debugging if we are able to show them the most likely error reason, and not have them always need to explore logs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
