[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

Hari Shreedharan (JIRA) Thu, 21 Aug 2014 13:43:22 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105928#comment-14105928
 ]


Hari Shreedharan commented on SPARK-3129:
-----------------------------------------

[~tgraves] - Thanks for the pointers. Yes, using HDFS also allows us to use the 
same file with some protection to store the keys. This is something that might 
some design and discussion first. 

I will also update the PR with the reflection code.

[~jerryshao]:
1. Today RDDs already get checkpointed at the end of every job when the runJob 
method gets called. Nothing is changing here. The entire graph does get 
checkpointed today already.
2. No, this is something that will need to be taken care of. When the driver 
dies, blocks can no longer be batched into RDDs - which means generating blocks 
without the driver makes no sense. Also, when the driver comes back online, new 
receivers get created, which would start receiving the data now. The only 
reason the executors are being kept around is to get the data in their memory - 
any processing/receiving should be killed.
3. Since it is an RDD, there is nothing that stops it from being recovered, 
right? It is recovered by the usual method of regenerating it. Only DStream 
data that has not been converted into an RDD is really lost - so getting the 
RDD back should not be a concern at all (of course, the cache is gone, but it 
can get pulled back into cache once the driver comes back up).


> Prevent data loss in Spark Streaming
> ------------------------------------
>
>                 Key: SPARK-3129
>                 URL: https://issues.apache.org/jira/browse/SPARK-3129
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Hari Shreedharan
>            Assignee: Hari Shreedharan
>         Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

Reply via email to