Github user liancheng commented on the pull request:
https://github.com/apache/spark/pull/12527#issuecomment-213034425
The root cause of the deadlock has been found. Essentially, we should
prevent "localhost" to be returned as `FileScanRDD` preferred locations. Here's
a detailed description of the whole processes:
1. The test case involves a shuffle, which results in a `ShuffleRowRDD`,
whose preferred locations are determined by the locations of the block manager
that serves corresponding map output blocks. In a word, in the case of local
test, the only preferred location string is the IP of the block manager.
1. In the case of local testing, `FileScanRDD.preferredLocations` always
returns "localhost".
1. As a result, task set `ts1` derived from the `ShuffleRowRDD` and task
set `ts2` derived from the `FileScanRDD` have different locality preference.
1. After job submission, `DAGScheduler` first schedules `ts1`. While
trying to schedule `ts2`, delayed scheduling is triggered because `ts1` and
`ts2` have different preferred locations. By default, `DAGScheduler` waits for
3s before trying `ts2` again.
1. 3s is long enough for all tasks in `ts1` to finish. However,
`LocalBackend` doesn't revive offers periodically like other scheduler
backends. It only revives offer when tasks are submitted, finish, or fail. Thus
`ts2` never gets an opportunity to be scheduled again, and the submitted job
never finishes.
The only factor that is not clear for now is how the number of buckets
(which affects number of submitted tasks) interact with the above process.
The fix for this issue is simple, just filter out all "localhost" in
`FileScanRDD.preferredLocations()` since "localhost" doesn't make sense as a
preferred executor location. Actually this is exactly the last step of what
`NewHadoopRDD.preferredLocations()` does.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]