GitHub user dibbhatt opened a pull request:
https://github.com/apache/spark/pull/8817
[SPARK-10694][STREAMING]Prevent Data Loss in Spark Streaming when used with
OFF_HEAP ExternalBlockStore (Tachyon)
The Motivation :
If Streaming application stores the blocks OFF_HEAP, it may not need any
WAL like feature to recover from Driver failure. As long as the writing of
blocks to Tachyon from Streaming receiver is durable, it should be recoverable
from Tachyon directly on Driver failure.
This can solve the issue of expensive WAL write and duplicating the blocks
both in MEMORY and also WAL and also guarantee end to end No-Data-Loss channel
using OFF_HEAP store.
The Challenges.
There are few challenges I faced while implementing this feature.
1 . Even If I enable the metadata check-pointing (which stores the
checkpoint files in HDFS ), if Driver fails, it is still not possible to
recover the block information from OFF_HEAP store as the Blocks are not longer
in BlockManagerMaster .
2. By default, TachyonBlockMaager stores the blocks in folders created
with Random ID and it uses the Executor-Id also in the path where it stores
the block. Thus if new Executor is started after failures, it can not read the
block written by earlier executor with different executor-id and in different
folder.
3. By default, TachyonBlockMaager uses Shutdown-hook to delete the files
written during a given context once context is shutdown . Hence if Streaming
program starts after the failure, it can not get the blocks form Tachyon.
The Solution .
1. For 1, I have implemented new RDD called ExternalStoreBlockRDD which
upon failure, try to fetch the blocks recovered via Metadata Checkpoint
location from ExternalBlockStore. It then update the BlockManagerMaster about
those recovered blocks.
2. For 2 , I made few configurable Tachyon folder location where Blocks
needs to be stored. This locations will not change during Streaming application
failure and restart as I have omitted the Random-Id based and Executor based
folder creation. The default behavior kept unchanged.
3. For 3 , also I made the configurable property to keep the Tachyon folder
not to be included in Shutdown-hook. Default if false ( i.e. Default behavior )
, but for Streaming application it need to set to true.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dibbhatt/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/8817.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #8817
----
commit bf10f92c6886f9520cfc5ed5d5b6349036d4e3f5
Author: Dibyendu Bhattacharya <[email protected]>
Date: 2015-09-18T10:22:50Z
Prevent Data Loss in Spark Streaming using ExternalBlockStore (Tachyon)
commit df9e91252e28455843b0342ad7b34b224b7a227a
Author: Dibyendu Bhattacharya <[email protected]>
Date: 2015-09-18T10:24:05Z
Prevent Data Loss in Spark Streaming using ExternalBlockStore (Tachyon)
commit 02a1514764f621aff68e44412b6e096d695ff2fe
Author: Dibyendu Bhattacharya <[email protected]>
Date: 2015-09-18T10:26:35Z
Prevent Data Loss in Spark Streaming using ExternalBlockStore (Tachyon)
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]