spektom edited a comment on issue #27022: [SPARK-28415][DSTREAMS] Add 
messageHandler to Kafka 10 direct stream API #25205
URL: https://github.com/apache/spark/pull/27022#issuecomment-570334765
 
 
   @koeninger Let me explain (probably my original description is not clear 
enough).
   
   Let's say, there are Kafka topics with huge JSON documents, and let's say my 
Spark streaming job only operates on several JSON fields. What I'd like to do 
is to strip down the original message at some early stage, and this is what the 
preliminary message handler allows me to do.  Now, I would strip the JSON 
content down as the first step when I get stream's RDD, but this would prevent 
me from Kafka offsets retrieval from RDD (because offsets retrieval must happen 
as the first operation on RDD).
   
   I've seen environments when Spark streaming applications simply wouldn't 
work because of tremendous memory consumption when operating on big JSON 
documents, and message handler was the remedy. Therefore, I think removal of 
this feature in the new API is some kind of regression to some workloads.
   
   Does this make sense?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to