[ 
https://issues.apache.org/jira/browse/SPARK-4174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197464#comment-14197464
 ] 

Hari Shreedharan commented on SPARK-4174:
-----------------------------------------

I will write up a doc soon, but here are some initial thoughts:

We should be able to provide one callback per batch, unfortunately receivers 
are not really batch aware. So there are couple of options:
- Take a function to be called per block, and call all of the functions in 
order. This can generate as many callbacks as the blocks per batch - this may 
not be huge but this also means that there is a chain of methods that would get 
called when each block gets completed. This is most likely the simplest as far 
as semantics are concerned. 
- We could add a new method to the Receiver API - 
`onBlockComplete(List[BlockId])`. This would allow the receiver to implement 
Unfortunately none of the push* methods returned the block id - therefore this 
method would really be useful only if the receiver calls the push* methods with 
the block id passed in. We'd have to document this though. 

I am open to other semantics as well, but the API is a bit limiting since we 
don't report the block info back to the receiver as of now.

The implementation idea I have for either would bepretty similar. Before I go 
into that, any thoughts on the semantics?


> Streaming: Optionally provide notifications to Receivers when DStream has 
> been generated
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-4174
>                 URL: https://issues.apache.org/jira/browse/SPARK-4174
>             Project: Spark
>          Issue Type: Improvement
>            Reporter: Hari Shreedharan
>            Assignee: Hari Shreedharan
>
> Receivers receiving data from Message Queues, like Active MQ, Kafka etc can 
> replay messages if required. Using the HDFS WAL mechanism for such systems 
> affects efficiency as we are incurring an unnecessary HDFS write when we can 
> recover the data from the queue anyway.
> We can fix this by providing a notification to the receiver when the RDD is 
> generated from the blocks. We need to consider the case where a receiver 
> might fail before the RDD is generated and come back on a different executor 
> when the RDD is generated. Either way, this is likely to cause duplicates and 
> not data loss -- so we may be ok.
> I am thinking about something of the order of accepting a callback function 
> which gets called when the RDD is generated. We can keep the function local 
> in a map of batch id -> function, which gets called when the function gets 
> generated (we can inform the ReceiverSupervisorImpl via Akka when the driver 
> generates the RDD). Of course, just an early thought - I will work on a 
> design doc for this one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to