[ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252256#comment-14252256
 ] 

Hari Shreedharan commented on SPARK-3146:
-----------------------------------------

I understand what you mean by flatMap (the reason I don't want to call it 
flatMap is that we already have a flatMap which behaves differently in terms of 
when and where it is executed). 

Interceptors can simply be functions passed in (the reason it is not a function 
in Flume is because Flume is all Java - functions could not be passed around 
till Java 8), the idea being the same. I think we both agree on the idea, we 
can discuss how it is exposed in more detail once we have more consensus on if 
it makes sense.

For (1), I see that metadata is something that might be helpful. What if we 
pass the metadata to the interceptor method too (see how the BlockGenerator 
gets it now?). It is better than rewriting it as a DStream[MessageAndMetadata] 
- which may not be as easy to get done.

> Improve the flexibility of Spark Streaming Kafka API to offer user the 
> ability to process message before storing into BM
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-3146
>                 URL: https://issues.apache.org/jira/browse/SPARK-3146
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 1.0.2, 1.1.0
>            Reporter: Saisai Shao
>
> Currently Spark Streaming Kafka API stores the key and value of each message 
> into BM for processing, potentially this may lose the flexibility for 
> different requirements:
> 1. currently topic/partition/offset information for each message is discarded 
> by KafkaInputDStream. In some scenarios people may need this information to 
> better filter the message, like SPARK-2388 described.
> 2. People may need to add timestamp for each message when feeding into Spark 
> Streaming, which can better measure the system latency.
> 3. Checkpointing the partition/offsets or others...
> So here we add a messageHandler in interface to give people the flexibility 
> to preprocess the message before storing into BM. In the meantime time this 
> improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to