Nishanth created SPARK-54039:
--------------------------------
Summary: Add TaskID to KafkaDataConsumer logs for better
correlation between Spark tasks and Kafka operations
Key: SPARK-54039
URL: https://issues.apache.org/jira/browse/SPARK-54039
Project: Spark
Issue Type: Improvement
Components: Structured Streaming
Affects Versions: 3.5.6
Reporter: Nishanth
Currently, the Kafka consumer logs generated by {{KafkaDataConsumer}} (in
{{{}org.apache.spark.sql.kafka010{}}}) do not include the Spark {{TaskID}} or
task context information.
This makes it difficult to correlate specific Kafka fetch or poll operations
with the corresponding Spark tasks during debugging and performance
investigations—especially when multiple tasks consume from the same Kafka topic
partitions concurrently on different executors.
Adding the Spark {{TaskID}} To the Kafka consumer, log statements would
significantly improve traceability and observability. It would allow engineers
to:
* Identify which Spark task triggered a specific Kafka poll or fetch
* Diagnose task-level Kafka latency or deserialization bottlenecks
* Simplify debugging of offset commit or consumer lag anomalies when multiple
concurrent consumers are running
h3. *Current Behavior*
The current Kafka source consumer logging framework does not include {{TaskID}}
information in the {{KafkaDataConsumer}} class.
*Example log output:*
{code:java}
14:30:16 INFO Executor: Running task 83.0 in stage 32.0 (TID 4188)
14:30:16 INFO KafkaBatchReaderFactoryWithRowBytesAccumulator: Creating Kafka
reader... taskId=4188 partitionId=83
14:35:19 INFO KafkaDataConsumer: From Kafka topicPartition=*******
groupId=******* read 12922 records through 80 polls (polled out 12979 records),
taking 299911.706198 ms, during time span of 303617.473973 ms  <-- No Task
Context Here
14:35:19 INFO Executor: Finished task 83.0 in stage 32.0 (TID 4188){code}
Although the Executor logs include the task context (TID), the
KafkaDataConsumer logs do not, making it difficult to correlate Kafka read
durations or latency with the specific Spark tasks that executed them. This
limits visibility into which part of a job spent the most time consuming from
Kafka.
h3. *Impact*
The absence of task context in {{KafkaDataConsumer}} logs creates several
challenges:
* *Performance Debugging:* Unable to pinpoint which specific Spark tasks
experience slow Kafka consumption.
* *Support/Troubleshooting:* Difficult to correlate customer incidents or
partition-level delays with specific task executions.
* *Metrics Analysis:* Cannot easily aggregate Kafka consumption metrics per
task for performance profiling or benchmarking.
* *Log Analysis:* Requires complex log correlation or pattern matching across
multiple executors to map Kafka operations to specific tasks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]