HeartSaVioR commented on PR #41525: URL: https://github.com/apache/spark/pull/41525#issuecomment-1588699211
> Another alternative is to log the timing every time poll() is called. Won't it be potentially to spamming? That was what I've been thinking through, but I agree that that could be shown as spamming, depending on how many records Kafka would give per fetch. What about providing statistics for all poll() happened in a single microbatch (more clearly, one cycle of the consumer, from borrow to close/return)? If there are multiple polls happened we can calculate simple (or slightly richer) stats and provide that at the end like we do in current change. If there is no poll happened in a cycle at all, we can simply log that no record has been fetched hence no stats. That way I think is much better than expecting end users to add up these metrics among microbatches by themselves. Many end users won't know about the implementation details. https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html > Along with consumers, Spark pools the records fetched from Kafka separately, to let Kafka consumers stateless in point of Spark’s view, and maximize the efficiency of pooling. It leverages same cache key with Kafka consumers pool. Note that it doesn’t leverage Apache Commons Pool due to the difference of characteristics. This is pretty much everything we talked about cache for fetched data. I think many users even don't get to the point till they encounter the problem and have to look at the guide doc. This is definitely a step forward on debuggability but we will need a way to provide the better visibility on operational perspective. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
