[jira] [Updated] (SPARK-54039) SS | `Add TaskID to KafkaDataConsumer logs for better correlation between Spark tasks and Kafka operations`

Nishanth (Jira) Mon, 27 Oct 2025 18:43:08 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-54039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nishanth updated SPARK-54039:
-----------------------------
    Description: 
Kafka consumer logs generated by {{KafkaDataConsumer}} do not include the Spark 
{{TaskID}} or task context information.

This makes it difficult to correlate specific Kafka fetch or poll operations 
with the corresponding Spark tasks during debugging and performance 
investigations—especially when multiple tasks consume from the same Kafka topic 
partitions concurrently on different executors.
h3. *Current Behavior*

The current Kafka source consumer logging framework does not include {{TaskID}} 
information in the {{KafkaDataConsumer}} class.

*Example log output:*
{code:java}
14:30:16 INFO Executor: Running task 83.0 in stage 32.0 (TID 4188)
14:35:19 INFO KafkaDataConsumer: From Kafka topicPartition=******* 
groupId=******* read 12922 records through 80 polls (polled out 12979 records), 
taking 299911.706198 ms, during time span of 303617.473973 ms  <-- No Task 
Context Here
14:35:19 INFO Executor: Finished task 83.0 in stage 32.0 (TID 4188){code}
Although the Executor logs include the task context (TID), the 
KafkaDataConsumer logs do not, making it difficult to correlate Kafka read 
durations or latency with the specific Spark tasks that executed them. This 
limits visibility into which part of a job spends the most time consuming from 
Kafka.

  was:
Kafka consumer logs generated by {{KafkaDataConsumer}} do not include the Spark 
{{TaskID}} or task context information.

This makes it difficult to correlate specific Kafka fetch or poll operations 
with the corresponding Spark tasks during debugging and performance 
investigations—especially when multiple tasks consume from the same Kafka topic 
partitions concurrently on different executors.
h3. *Current Behavior*

The current Kafka source consumer logging framework does not include {{TaskID}} 
information in the {{KafkaDataConsumer}} class.

*Example log output:*
{code:java}
14:30:16 INFO Executor: Running task 83.0 in stage 32.0 (TID 4188)
14:30:16 INFO KafkaBatchReaderFactoryWithRowBytesAccumulator: Creating Kafka 
reader... taskId=4188 partitionId=83
14:35:19 INFO KafkaDataConsumer: From Kafka topicPartition=******* 
groupId=******* read 12922 records through 80 polls (polled out 12979 records), 
taking 299911.706198 ms, during time span of 303617.473973 ms  <-- No Task 
Context Here
14:35:19 INFO Executor: Finished task 83.0 in stage 32.0 (TID 4188){code}
Although the Executor logs include the task context (TID), the 
KafkaDataConsumer logs do not, making it difficult to correlate Kafka read 
durations or latency with the specific Spark tasks that executed them. This 
limits visibility into which part of a job spends the most time consuming from 
Kafka.


> SS | `Add TaskID to KafkaDataConsumer logs for better correlation between 
> Spark tasks and Kafka operations`
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-54039
>                 URL: https://issues.apache.org/jira/browse/SPARK-54039
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.4.1, 3.5.0, 3.5.2, 4.0.0
>            Reporter: Nishanth
>            Priority: Major
>              Labels: pull-request-available
>
> Kafka consumer logs generated by {{KafkaDataConsumer}} do not include the 
> Spark {{TaskID}} or task context information.
> This makes it difficult to correlate specific Kafka fetch or poll operations 
> with the corresponding Spark tasks during debugging and performance 
> investigations—especially when multiple tasks consume from the same Kafka 
> topic partitions concurrently on different executors.
> h3. *Current Behavior*
> The current Kafka source consumer logging framework does not include 
> {{TaskID}} information in the {{KafkaDataConsumer}} class.
> *Example log output:*
> {code:java}
> 14:30:16 INFO Executor: Running task 83.0 in stage 32.0 (TID 4188)
> 14:35:19 INFO KafkaDataConsumer: From Kafka topicPartition=******* 
> groupId=******* read 12922 records through 80 polls (polled out 12979 
> records), taking 299911.706198 ms, during time span of 303617.473973 ms  <-- 
> No Task Context Here
> 14:35:19 INFO Executor: Finished task 83.0 in stage 32.0 (TID 4188){code}
> Although the Executor logs include the task context (TID), the 
> KafkaDataConsumer logs do not, making it difficult to correlate Kafka read 
> durations or latency with the specific Spark tasks that executed them. This 
> limits visibility into which part of a job spends the most time consuming 
> from Kafka.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-54039) SS | `Add TaskID to KafkaDataConsumer logs for better correlation between Spark tasks and Kafka operations`

Reply via email to