[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

Hari Shreedharan (JIRA) Thu, 04 Sep 2014 14:36:07 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122006#comment-14122006
 ]


Hari Shreedharan commented on SPARK-3129:
-----------------------------------------

Yes, so my initial goal is to be able to recover all the blocks that have not 
been made into an RDD yet (at which point it would be safe). There is data 
which may not have become a block yet (which are created using the += operator) 
- for now, I am going to call it fair game to say that we are going to be 
adding storeReliably(ArrayBuffer/Iterable) methods which are the only ones that 
store data such that they are guaranteed to be recovered.

At a later stage, we could use something like a WAL on HDFS to recover even the 
+= data, though that would affect performance.



> Prevent data loss in Spark Streaming
> ------------------------------------
>
>                 Key: SPARK-3129
>                 URL: https://issues.apache.org/jira/browse/SPARK-3129
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Hari Shreedharan
>            Assignee: Hari Shreedharan
>         Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

Reply via email to