[GitHub] [spark] siying commented on pull request #41525: [SPARK-44012][SS]KafkaDataConsumer to print some read status

via GitHub Mon, 12 Jun 2023 09:30:14 -0700


siying commented on PR #41525:
URL: https://github.com/apache/spark/pull/41525#issuecomment-1587669644


   @HeartSaVioR in my understanding, we are right now essentially timing 
Poll(), which is pretty much what InternalKafkaConsumer.fetch() does and what 
we measure. I guess your concern is that when we report the metric per 
microbatch, some microbatch might show higher latency and some lower and it is 
misleading to take individual ones? I feel it OK as people do understand 
buffering and they need to add up multiple microbatches to find the real costs. 
The alternative approach is not less confusing either. If we, for example, 
trace in InternalKafkaConsumer, and report it periodically outside microbatch 
boundary, the execution period might include tasks from other jobs, and make it 
hard to match with tasks it serves. Another alternative is to log the timing 
every time poll() is called. Won't it be potentially to spamming?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] siying commented on pull request #41525: [SPARK-44012][SS]KafkaDataConsumer to print some read status

Reply via email to