[ https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Saisai Shao updated SPARK-3146: ------------------------------- Description: Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. 3. Checkpointing the partition/offsets or others... So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. was: Currently Spark Streaming Kafka API stores the key and value of each message into BM for processing, potentially this may lose the flexibility for different requirements: 1. currently topic/partition/offset information for each message is discarded by KafkaInputDStream. In some scenarios people may need this information to better filter the message, like SPARK-2388 described. 2. People may need to add timestamp for each message when feeding into Spark Streaming, which can better measure the system latency. So here we add a messageHandler in interface to give people the flexibility to preprocess the message before storing into BM. In the meantime time this improvement keep compatible with current API. > Improve the flexibility of Spark Streaming Kafka API to offer user the > ability to process message before storing into BM > ------------------------------------------------------------------------------------------------------------------------ > > Key: SPARK-3146 > URL: https://issues.apache.org/jira/browse/SPARK-3146 > Project: Spark > Issue Type: Improvement > Components: Streaming > Affects Versions: 1.0.2 > Reporter: Saisai Shao > > Currently Spark Streaming Kafka API stores the key and value of each message > into BM for processing, potentially this may lose the flexibility for > different requirements: > 1. currently topic/partition/offset information for each message is discarded > by KafkaInputDStream. In some scenarios people may need this information to > better filter the message, like SPARK-2388 described. > 2. People may need to add timestamp for each message when feeding into Spark > Streaming, which can better measure the system latency. > 3. Checkpointing the partition/offsets or others... > So here we add a messageHandler in interface to give people the flexibility > to preprocess the message before storing into BM. In the meantime time this > improvement keep compatible with current API. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org