[
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237372#comment-14237372
]
Andrew Or commented on SPARK-4759:
----------------------------------
Found the issue. The task scheduler schedules tasks based on the preferred
locations specified by the partition. In CoalescedRDD's partitions, we use the
empty string as the default preferred location, even though this does not
actually represent a real host:
https://github.com/apache/spark/blob/e895e0cbecbbec1b412ff21321e57826d2d0a982/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L41
As a result, the task scheduler doesn't schedule a subset of the tasks on the
local executor because these tasks are supposed to be scheduled on the host ""
(empty string) that doesn't actually exist. I have not dug into the details of
PartitionCoalescer as to why this is only specific to local mode.
I'll submit a fix shortly.
> Deadlock in complex spark job in local mode
> -------------------------------------------
>
> Key: SPARK-4759
> URL: https://issues.apache.org/jira/browse/SPARK-4759
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.1.1, 1.2.0, 1.3.0
> Environment: Java version "1.7.0_51"
> Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
> Mac OSX 10.10.1
> Using local spark context
> Reporter: Davis Shepherd
> Assignee: Andrew Or
> Priority: Critical
> Attachments: SparkBugReplicator.scala
>
>
> The attached test class runs two identical jobs that perform some iterative
> computation on an RDD[(Int, Int)]. This computation involves
> # taking new data merging it with the previous result
> # caching and checkpointing the new result
> # rinse and repeat
> The first time the job is run, it runs successfully, and the spark context is
> shut down. The second time the job is run with a new spark context in the
> same process, the job hangs indefinitely, only having scheduled a subset of
> the necessary tasks for the final stage.
> Ive been able to produce a test case that reproduces the issue, and I've
> added some comments where some knockout experimentation has left some
> breadcrumbs as to where the issue might be.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]