Juho Salmio created SPARK-42923:
-----------------------------------

             Summary: Delayed scheduling doesn’t work in some situations in 
local mode if different localities present in loaded files leading to tasks 
getting stuck
                 Key: SPARK-42923
                 URL: https://issues.apache.org/jira/browse/SPARK-42923
             Project: Spark
          Issue Type: Bug
          Components: Scheduler
    Affects Versions: 3.3.2
            Reporter: Juho Salmio


I stumbled on the following issue when running spark in local mode where part 
of the loaded files were present in the same host as the spark and others not.

Symptom: Some task in larger job would consistently get stuck without no 
immediately clear errors in logs. My hope/expectation would have been that even 
if some tasks would have failed to complete on during some expected time the 
job would have retried the task or failed completely with some exception and 
not just get stuck forever.

Workaround:
Setting spark.locality.wait.node to 0s seemed to fix the getting stuck in my 
environment.

Potential root cause:
I managed to reproduce the issue with the spark codebase by adding a test case 
to FileSourceStrategySuite, which is trying to read two files to a table where 
another is located in the same host as the local spark executor and another in 
some other host. 
https://github.com/apache/spark/commit/c23db78863c7342ae7b7bc3922a200a523e45538

While digging into the issue with the debugger I finally noticed that the 
LocalSchedulerBackend is missing the reviveThread present in 
CoarseGrainedSchedulerBackend, which forces the periodic calling of 
resourceOffsers in TaskSchedulerImpl and not just in taskUpdates.

Potential fix:
Add the revive thread also to LocalSchedulerBackend.
I don’t really have understanding of the codebase whether simply adding the 
revive thread to LocalSchedulerBackend could have some unwanted side effects.

Questions/Observations:
Should delayed scheduling work at all in local mode?
This issue probably effect also the case where instead of local file there is 
file which is rack local to the executor and then some non rack local file, 
which are being loaded.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to