arham0254A commented on PR #56736:
URL: https://github.com/apache/spark/pull/56736#issuecomment-4801968776

   @pan3793 That would definitely be the ideal behavior for downstream debugging
   
   However, keeping in mind the initial issue reported—where external 
orchestrators (like Airflow or bash scripts) are suffering from silent pipeline 
failures because `spark-submit` is returning a `0` (success) despite the remote 
driver crashing—this PR is aimed at providing an immediate fix to stop those 
false positives.
   
   Currently, the `reportDriverStatus` method relies entirely on the 
`DriverStatusResponse` RPC message, which only contains the `DriverState` enum 
(FINISHED, FAILED, ERROR, KILLED) and an `Option[Exception]`. The actual 
integer exit code of the remote JVM process isn't currently being passed back 
from the Master to the Client in that payload.
   
   To forward the real exit code, we would need to significantly expand the 
scope of this PR by modifying the internal RPC protocol across the Worker, 
Master, and Client to capture, store, and transmit that specific integer. 
   
   Given that architectural constraint, would it be acceptable to stick with a 
generic non-zero code (`-1`) for this PR to immediately resolve the critical 
silent-failure bug for orchestrators, and perhaps open a follow-up ticket to 
enhance the RPC protocol later?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to