GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/15085
[SPARK-17484] Prevent invalid block locations from being reported after
put() exceptions
## What changes were proposed in this pull request?
If a BlockManager `put()` call failed after the BlockManagerMaster was
notified of a block's availability then incomplete cleanup logic in a `finally`
block would never send a second block status method to inform the master of the
block's unavailability. This, in turn, leads to fetch failures and used to be
capable of causing complete job failures before #15037 was fixed.
This patch addresses this issue via multiple small changes:
- The `finally` block now calls `removeBlockInternal` when cleaning up from
a failed `put()`; in addition to removing the `BlockInfo` entry (which was
_all_ that the old cleanup logic did), this code (redundantly) tries to remove
the block from the memory and disk stores (as an added layer of defense against
bugs lower down in the stack) and optionally notifies the master of block
removal (which now happens during exception-triggered cleanup).
- When a BlockManager receives a request for a block that it does not have
it will now notify the master to update its block locations. This ensures that
bad metadata pointing to non-existent blocks will eventually be fixed. Note
that I could have implemented this logic in the block manager client (rather
than in the remote server), but that would introduce the problem of
distinguishing between transient and permanent failures; on the server,
however, we know definitively that the block isn't present.
- Catch `NonFatal` instead of `Exception` to avoid swallowing
`InterruptedException`s thrown from synchronous block replication calls.
This patch depends upon the refactorings in #15036, so that other patch
will also have to be backported when backporting this fix.
For more background on this issue, including example logs from a real
production failure, see
[SPARK-17484](https://issues.apache.org/jira/browse/SPARK-17484).
## How was this patch tested?
Two new regression tests in BlockManagerSuite.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark SPARK-17484
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15085.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15085
----
commit ecc81f70a29c6568449c755f5dd6408610d6e56b
Author: Josh Rosen <[email protected]>
Date: 2016-09-13T19:45:59Z
Add regression test for cleanup after put() exception
commit ba3a9bd68ccc6e360170010deb11bb719819d54a
Author: Josh Rosen <[email protected]>
Date: 2016-09-13T19:48:13Z
Don't swallow InterruptedException in block replication code.
commit b39531323aa68510f7146900fe2f3241110d492e
Author: Josh Rosen <[email protected]>
Date: 2016-09-13T19:46:52Z
Add regression test for update of invalid block locations after fetch
failure
commit aa75e2d7cca5820893fd059462e236a5ccdb77d3
Author: Josh Rosen <[email protected]>
Date: 2016-09-13T18:37:07Z
Perform more complete cleanup in put() finally block.
commit 6609c2a9595a880a77e9356abd71a1688ff67615
Author: Josh Rosen <[email protected]>
Date: 2016-09-09T18:20:25Z
Unconditionally update master's block status when handling invalid request.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]