spektom edited a comment on issue #27022: [SPARK-28415][DSTREAMS] Add messageHandler to Kafka 10 direct stream API #25205 URL: https://github.com/apache/spark/pull/27022#issuecomment-570334765 @koeninger Let me explain (probably my original description is not clear enough). Let's say, there are Kafka topics with huge JSON documents, and let's say my Spark streaming job only operates on several JSON fields. What I'd like to do is to strip down the original message at some early stage, and this is what the preliminary message handler allows me to do. Now, I would strip the JSON content down as the first step when I get stream's RDD, but this would prevent me from Kafka offsets retrieval from RDD (because offsets retrieval must happen as the first operation on RDD). I've seen environments when Spark streaming applications simply wouldn't work because of tremendous memory consumption when operating on big JSON documents, and message handler was the remedy. Therefore, I think removal of this feature in the new API is some kind of regression to some workloads. Does this make sense?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
