GitHub user tdas opened a pull request: https://github.com/apache/spark/pull/2366
[SPARK-3495] Block replication fails continuously when the replication target node is dead If a block manager (say, A) wants to replicate a block and the node chosen for replication (say, B) is dead, then the attempt to send the block to B fails. However, this continues to fail indefinitely. Even if the driver learns about the demise of the B, A continues to try replicating to B and failing miserably. The reason behind this bug is that A initially fetches a list of peers from the driver (when B was active), but never updates it after B is dead. This affects Spark Streaming as its receiver uses block replication. The solution in this patch adds the following. - Changed BlockManagerMaster to return all the peers of a block manager, rather than the requested number. - Refactored BlockManager's replication code to handle peer caching correctly. + The peer for replication is randomly selected. This is different from past behavior where for a node A, a node B was deterministically chosen for the lifetime of the application. + If replication fails to one node, the peers are refetched. + The peer cached has a TTL of 1 second to enable discovery of new peers and using them for replication. - Added replication unit tests (replication was not tested till now, duh!) This should not make a difference in performance of Spark workloads where replication is not used. @andrewor14 @JoshRosen You can merge this pull request into a Git repository by running: $ git pull https://github.com/tdas/spark replication-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2366.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2366 ---- commit af0c1daea8a22bca3b7826322205c887370ce247 Author: Tathagata Das <tathagata.das1...@gmail.com> Date: 2014-09-11T02:54:31Z Added replication unit tests to BlockManagerSuite commit 9f0ac9fb20660ff183490d13f4e3195b9283bc61 Author: Tathagata Das <tathagata.das1...@gmail.com> Date: 2014-09-11T08:44:18Z Modified replication tests to fail on replication bug. commit d081bf60e87689994a006603f84cb8f22ab19c6a Author: Tathagata Das <tathagata.das1...@gmail.com> Date: 2014-09-11T20:58:14Z Fixed bug in get peers and unit tests to test get-peers and replication under executor churn. commit 03de02d532f51b23bc1b79fc76115aacbd64a4b1 Author: Tathagata Das <tathagata.das1...@gmail.com> Date: 2014-09-12T00:46:02Z Change replication logic to correctly refetch peers from master on failure and on new worker addition. commit 7598f913c52728f25b6bce91dd9ae6879105e261 Author: Tathagata Das <tathagata.das1...@gmail.com> Date: 2014-09-12T00:52:16Z Minor changes. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org