jiangxb1987 commented on issue #26975: [SPARK-26975][CORE] Stage retry and 
executor crash cause app hung up forever
URL: https://github.com/apache/spark/pull/26975#issuecomment-568607140
 
 
   I think I can confirm this is a bug and it's caused by we adding the 
`sched.markPartitionCompletedInAllTaskSets` logic, that when a task attempt 
from one TSM succeeded it shall mark the partition as completed for all the 
TSMs targeting the same Stage. Unfortunately the missing part is we didn't try 
to kill the running task attempts when we mark the partitions as completed, 
thus when the running task attetmpts failed with ExecutorLost it would revert 
the completed partition result (which is not necessary).
   
   To me the best solution here would be to kill all the running task attempts 
in the TSM inside method `markPartitionCompleted`, this would resolve the issue 
without any side affect.
   
   Also cc @squito @cloud-fan @Ngone51 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to