[
https://issues.apache.org/jira/browse/SPARK-54039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nishanth updated SPARK-54039:
-----------------------------
Affects Version/s: 4.0.0
3.5.2
3.5.0
3.4.1
(was: 3.5.6)
> SS | `Add TaskID to KafkaDataConsumer logs for better correlation between
> Spark tasks and Kafka operations`
> -----------------------------------------------------------------------------------------------------------
>
> Key: SPARK-54039
> URL: https://issues.apache.org/jira/browse/SPARK-54039
> Project: Spark
> Issue Type: Improvement
> Components: Structured Streaming
> Affects Versions: 3.4.1, 3.5.0, 3.5.2, 4.0.0
> Reporter: Nishanth
> Priority: Major
> Fix For: 4.1.0
>
>
> Currently, the Kafka consumer logs generated by {{KafkaDataConsumer}} (in
> {{{}org.apache.spark.sql.kafka010{}}}) do not include the Spark {{TaskID}} or
> task context information.
> This makes it difficult to correlate specific Kafka fetch or poll operations
> with the corresponding Spark tasks during debugging and performance
> investigations—especially when multiple tasks consume from the same Kafka
> topic partitions concurrently on different executors.
> Adding the Spark {{TaskID}} To the Kafka consumer, log statements would
> significantly improve traceability and observability. It would allow
> engineers to:
> * Identify which Spark task triggered a specific Kafka poll or fetch
> * Diagnose task-level Kafka latency or deserialization bottlenecks
> * Simplify debugging of offset commit or consumer lag anomalies when
> multiple concurrent consumers are running
> h3. *Current Behavior*
> The current Kafka source consumer logging framework does not include
> {{TaskID}} information in the {{KafkaDataConsumer}} class.
> *Example log output:*
> {code:java}
> 14:30:16 INFO Executor: Running task 83.0 in stage 32.0 (TID 4188)
> 14:30:16 INFO KafkaBatchReaderFactoryWithRowBytesAccumulator: Creating Kafka
> reader... taskId=4188 partitionId=83
> 14:35:19 INFO KafkaDataConsumer: From Kafka topicPartition=*******
> groupId=******* read 12922 records through 80 polls (polled out 12979
> records), taking 299911.706198 ms, during time span of 303617.473973 ms  <--
> No Task Context Here
> 14:35:19 INFO Executor: Finished task 83.0 in stage 32.0 (TID 4188){code}
> Although the Executor logs include the task context (TID), the
> KafkaDataConsumer logs do not, making it difficult to correlate Kafka read
> durations or latency with the specific Spark tasks that executed them. This
> limits visibility into which part of a job spent the most time consuming from
> Kafka.
> h3. *Impact*
> The absence of task context in {{KafkaDataConsumer}} logs creates several
> challenges:
> * *Performance Debugging:* Unable to pinpoint which specific Spark tasks
> experience slow Kafka consumption.
> * *Support/Troubleshooting:* Difficult to correlate customer incidents or
> partition-level delays with specific task executions.
> * *Metrics Analysis:* Cannot easily aggregate Kafka consumption metrics per
> task for performance profiling or benchmarking.
> * *Log Analysis:* Requires complex log correlation or pattern matching
> across multiple executors to map Kafka operations to specific tasks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]