ForVic commented on PR #52068:
URL: https://github.com/apache/spark/pull/52068#issuecomment-3412954635

   > To @ForVic , @mridulm and @sunchao , I have two questions.
   > 
   > 1. Are the failed driver pods supposed to be kept for a while with the 
diagnosis annotation?
   > 2. Who are the final audience to be able to see the diagnosis? A user or 
some other scrapper or automated systems? Could you give us some examples which 
you are currently using?
   @dongjoon-hyun 
   1. Driver pod lifecycle is outside of Spark, so it's free for spark operator 
or external systems to manage it how they'd like. If they delete instantly than 
this implementation isn't useful to them. I've seen instances of having a watch 
on the driver pod and on completion capturing diagnostics and then deleting pod 
and I've also seen where there is a periodic polling of completed driver pods 
and then could capture this field and then delete.
   2. Both users and automation. It's improved automated tooling for things 
like auto-memory-scaling on OOM where we use this to capture the error and 
classify. Also when the user class has an exception that fails the job this 
makes it easier for debugging if we are able to show them the most likely error 
reason, and not have them always need to explore logs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to