spektom opened a new pull request #27022: [SPARK-28415][DSTREAMS] Add messageHandler to Kafka 10 direct stream API #25205 URL: https://github.com/apache/spark/pull/27022 ## What changes were proposed in this pull request? This patch introduces messageHandler parameter that can be provided to Kafka DStream, which allows processing events received from Kafka at an early stage. Lack of messageHandler parameter to KafkaUtils.createDirectStrem(...) in the new Kafka 10 API is what prevents us from upgrading our processes to use it, and here's why: 1. messageHandler() allowed parsing / filtering / projecting huge JSON files at an early stage (only a small subset of JSON fields is required for a process), without this current cluster configuration doesn't keep up with the traffic. 2. Transforming Kafka events right after a stream is created prevents from using HasOffsetRanges interface later. This means that whole message must be propagated to the end of a pipeline, which is very ineffective. ## How was this patch tested? Unit tests are provided.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
