GitHub user dibbhatt opened a pull request:

    https://github.com/apache/spark/pull/8817

    [SPARK-10694][STREAMING]Prevent Data Loss in Spark Streaming when used with 
OFF_HEAP ExternalBlockStore (Tachyon)

    The Motivation :
    
    If Streaming application stores the blocks OFF_HEAP, it may not need any 
WAL like feature to recover from Driver failure. As long as the writing of 
blocks to Tachyon from Streaming receiver is durable, it should be recoverable 
from Tachyon directly on Driver failure. 
    This can solve the issue of expensive WAL write and duplicating the blocks 
both in MEMORY and also WAL and also guarantee end to end No-Data-Loss channel 
using OFF_HEAP store.
    
    The Challenges. 
    
    There are few challenges I faced while implementing this feature. 
    
    1 . Even If I enable the metadata check-pointing (which stores the 
checkpoint files in HDFS ), if Driver fails, it is still not possible to 
recover the block information from OFF_HEAP store as the Blocks are not longer 
in BlockManagerMaster . 
    
    2. By default,  TachyonBlockMaager stores the blocks in folders created 
with Random ID  and it uses the Executor-Id also in the path where it stores 
the block. Thus if new Executor is started after failures, it can not read the 
block written by earlier executor with different executor-id and in different 
folder.
    
    3.  By default, TachyonBlockMaager uses Shutdown-hook to delete the files 
written during a given context once context is shutdown . Hence if Streaming 
program starts after the failure, it can not get the blocks form  Tachyon.
    
    
    The Solution .
    
    1. For 1, I have implemented new RDD called ExternalStoreBlockRDD which 
upon failure, try to fetch the blocks recovered  via Metadata Checkpoint 
location from ExternalBlockStore. It then update the BlockManagerMaster about 
those recovered blocks. 
    
    2. For 2 , I made few configurable Tachyon folder location where Blocks 
needs to be stored. This locations will not change during Streaming application 
failure and restart as I have omitted the Random-Id based and Executor based 
folder creation. The default behavior kept unchanged.
    
    3. For 3 , also I made the configurable property to keep the Tachyon folder 
not to be included in Shutdown-hook. Default if false ( i.e. Default behavior ) 
, but for Streaming application it need to set to true. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dibbhatt/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8817.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8817
    
----
commit bf10f92c6886f9520cfc5ed5d5b6349036d4e3f5
Author: Dibyendu Bhattacharya <[email protected]>
Date:   2015-09-18T10:22:50Z

    Prevent Data Loss in Spark Streaming using ExternalBlockStore (Tachyon)

commit df9e91252e28455843b0342ad7b34b224b7a227a
Author: Dibyendu Bhattacharya <[email protected]>
Date:   2015-09-18T10:24:05Z

    Prevent Data Loss in Spark Streaming using ExternalBlockStore (Tachyon)

commit 02a1514764f621aff68e44412b6e096d695ff2fe
Author: Dibyendu Bhattacharya <[email protected]>
Date:   2015-09-18T10:26:35Z

    Prevent Data Loss in Spark Streaming using ExternalBlockStore (Tachyon)

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to