[
https://issues.apache.org/jira/browse/SPARK-4174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hari Shreedharan updated SPARK-4174:
------------------------------------
Issue Type: Improvement (was: Bug)
> Streaming: Optionally provide notifications to Receivers when DStream has
> been generated
> ----------------------------------------------------------------------------------------
>
> Key: SPARK-4174
> URL: https://issues.apache.org/jira/browse/SPARK-4174
> Project: Spark
> Issue Type: Improvement
> Reporter: Hari Shreedharan
> Assignee: Hari Shreedharan
>
> Receivers receiving data from Message Queues, like Active MQ, Kafka etc can
> replay messages if required. Using the HDFS WAL mechanism for such systems
> affects efficiency as we are incurring an unnecessary HDFS write when we can
> recover the data from the queue anyway.
> We can fix this by providing a notification to the receiver when the RDD is
> generated from the blocks. We need to consider the case where a receiver
> might fail before the RDD is generated and come back on a different executor
> when the RDD is generated. Either way, this is likely to cause duplicates and
> not data loss -- so we may be ok.
> I am thinking about something of the order of accepting a callback function
> which gets called when the RDD is generated. We can keep the function local
> in a map of batch id -> function, which gets called when the function gets
> generated (we can inform the ReceiverSupervisorImpl via Akka when the driver
> generates the RDD). Of course, just an early thought - I will work on a
> design doc for this one.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]