[ 
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984950#comment-13984950
 ] 

Hari Shreedharan commented on SPARK-1645:
-----------------------------------------

The first one is not exactly accurate, though it explains the idea. The second 
is what I suggest.

As a first step, we do what is currently done - the receiver stores the data 
locally and acknowledges, so the reliability does not improve. Later we can 
make the improvement for all receivers that the data becomes persisted all the 
way to the driver (add a new API like storeReliably or something). 

We would have to do a two step Poll-ACK process. We can have the initial poll 
create a new request added to the ones that are pending for commit in the sink. 
Once the receiver has written it (for now in the current way, later reliably) - 
it sends out an ACK for the request id, that causes the request to be 
committed, so Flume can remove the events. If the receiver does not send the 
ACK, then the sink can have a scheduled thread come (this timeout can be 
specified in the flume config) and rollback and make the data available again 
(Flume already has the capability to make uncommitted txns to be made available 
if that agent fails).

> Improve Spark Streaming compatibility with Flume
> ------------------------------------------------
>
>                 Key: SPARK-1645
>                 URL: https://issues.apache.org/jira/browse/SPARK-1645
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>            Reporter: Hari Shreedharan
>
> Currently the following issues affect Spark Streaming and Flume compatibilty:
> * If a spark worker goes down, it needs to be restarted on the same node, 
> else Flume cannot send data to it. We can fix this by adding a Flume receiver 
> that is polls Flume, and a Flume sink that supports this.
> * Receiver sends acks to Flume before the driver knows about the data. The 
> new receiver should also handle this case.
> * Data loss when driver goes down - This is true for any streaming ingest, 
> not just Flume. I will file a separate jira for this and we should work on it 
> there. This is a longer term project and requires considerable development 
> work.
> I intend to start working on these soon. Any input is appreciated. (It'd be 
> great if someone can add me as a contributor on jira, so I can assign the 
> jira to myself).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to