Mridul Muralidharan created SPARK-24755:
-------------------------------------------
Summary: Executor loss can cause task to be not resubmitted
Key: SPARK-24755
URL: https://issues.apache.org/jira/browse/SPARK-24755
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.3.0
Reporter: Mridul Muralidharan
As part of SPARK-22074, when an executor is lost, TSM.executorLost currently
checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide
if task needs to be resubmitted for partition.
Consider following:
For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 respectively
(one of them being speculative task)
T1 finishes successfully first.
This results in setting "killedByOtherAttempt(P1) = true" due to running T2.
We also end up killing task T2.
Now, exec-1 if/when goes MIA.
executorLost will no longer schedule task for P1 - since
killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there is
no other copy of P1 around (T2 was killed, not T1 - which was successful).
I noticed this bug as part of reviewing PR# 21653 for SPARK-13343
Essentially, SPARK-22074 causes a regression (which I dont usually observe due
to shuffle service, sigh) - and as such the fix is broken IMO : I believe it
got introduced as part of the review (the original change looked fine to me -
but I did not look at it in detail).
I dont have a PR handy for this, so if anyone wants to pick it up, please do
feel free !
+CC [~XuanYuan] who fixed SPARK-22074 initially.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]