[
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254100#comment-14254100
]
Cody Koeninger commented on SPARK-3146:
---------------------------------------
This is a real problem for production use, not an abstract concern about what
users might or might not do if you gave them full access to power that they
already should have had.
The patch is a working implementation that solves the problem, we've been using
it in production.
I think the patch needs to be improved to take a function MessageAndMetadata =>
Iterable[R] instead of the current MessageAndMetadata => R (in other words it
needs to be a flatMap operation instead of the current map).
If for some reason this needs to be added to all receivers, and it needs to be
done via a setter instead of a constructor argument, I don't understand why,
but it will solve the problem, so I'm all for it.
If it really needs to be a function (Any, Any) => Iterable[R], I think that's
worse from a user interface point of view, but again it will solve the problem,
so I'm all for it.
If someone with the power to merge PRs will say which of those options they
would be willing to accept, I'll happily implement it.
What I definitely don't want to see is for this ticket to languish for another
6 months, during which time I keep having to do manual maintenance of a patched
Spark.
> Improve the flexibility of Spark Streaming Kafka API to offer user the
> ability to process message before storing into BM
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-3146
> URL: https://issues.apache.org/jira/browse/SPARK-3146
> Project: Spark
> Issue Type: Improvement
> Components: Streaming
> Affects Versions: 1.0.2, 1.1.0
> Reporter: Saisai Shao
>
> Currently Spark Streaming Kafka API stores the key and value of each message
> into BM for processing, potentially this may lose the flexibility for
> different requirements:
> 1. currently topic/partition/offset information for each message is discarded
> by KafkaInputDStream. In some scenarios people may need this information to
> better filter the message, like SPARK-2388 described.
> 2. People may need to add timestamp for each message when feeding into Spark
> Streaming, which can better measure the system latency.
> 3. Checkpointing the partition/offsets or others...
> So here we add a messageHandler in interface to give people the flexibility
> to preprocess the message before storing into BM. In the meantime time this
> improvement keep compatible with current API.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]