[
https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984780#comment-13984780
]
Hari Shreedharan commented on SPARK-1645:
-----------------------------------------
No, the sink would run inside the Flume agent that Spark is receiving data
from. (Sink is a flume component that pushes data out - this is managed by
Flume). Basically, this sink pulls data from the Flume agent's buffer when
Spark receiver polls it. If the receiver dies and restarts, as long as the
receiver knows which agent to poll the receiver will be able to get the data.
This solves the case where Flume is pushing data to a receiver which may have
died and restarted elsewhere - since Spark now polls Flume
> Improve Spark Streaming compatibility with Flume
> ------------------------------------------------
>
> Key: SPARK-1645
> URL: https://issues.apache.org/jira/browse/SPARK-1645
> Project: Spark
> Issue Type: Bug
> Components: Streaming
> Reporter: Hari Shreedharan
>
> Currently the following issues affect Spark Streaming and Flume compatibilty:
> * If a spark worker goes down, it needs to be restarted on the same node,
> else Flume cannot send data to it. We can fix this by adding a Flume receiver
> that is polls Flume, and a Flume sink that supports this.
> * Receiver sends acks to Flume before the driver knows about the data. The
> new receiver should also handle this case.
> * Data loss when driver goes down - This is true for any streaming ingest,
> not just Flume. I will file a separate jira for this and we should work on it
> there. This is a longer term project and requires considerable development
> work.
> I intend to start working on these soon. Any input is appreciated. (It'd be
> great if someone can add me as a contributor on jira, so I can assign the
> jira to myself).
--
This message was sent by Atlassian JIRA
(v6.2#6252)