[ 
https://issues.apache.org/jira/browse/SPARK-54039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishanth updated SPARK-54039:
-----------------------------
    Description: 
Currently, the Kafka consumer logs generated by {{KafkaDataConsumer}} do not 
include the Spark {{TaskID}} or task context information.

This makes it difficult to correlate specific Kafka fetch or poll operations 
with the corresponding Spark tasks during debugging and performance 
investigations—especially when multiple tasks consume from the same Kafka topic 
partitions concurrently on different executors.

Adding the Spark {{TaskID}} To the Kafka consumer, log statements would 
significantly improve traceability and observability. It would allow engineers 
to:
 * Identify which Spark task triggered a specific Kafka poll or fetch

 * Diagnose task-level Kafka latency or deserialization bottlenecks

 * Simplify debugging of offset commit or consumer lag anomalies when multiple 
concurrent consumers are running

h3. *Current Behavior*

The current Kafka source consumer logging framework does not include {{TaskID}} 
information in the {{KafkaDataConsumer}} class.

*Example log output:*
{code:java}
14:30:16 INFO Executor: Running task 83.0 in stage 32.0 (TID 4188)
14:30:16 INFO KafkaBatchReaderFactoryWithRowBytesAccumulator: Creating Kafka 
reader... taskId=4188 partitionId=83
14:35:19 INFO KafkaDataConsumer: From Kafka topicPartition=******* 
groupId=******* read 12922 records through 80 polls (polled out 12979 records), 
taking 299911.706198 ms, during time span of 303617.473973 ms  <-- No Task 
Context Here
14:35:19 INFO Executor: Finished task 83.0 in stage 32.0 (TID 4188){code}
Although the Executor logs include the task context (TID), the 
KafkaDataConsumer logs do not, making it difficult to correlate Kafka read 
durations or latency with the specific Spark tasks that executed them. This 
limits visibility into which part of a job spent the most time consuming from 
Kafka.
h3. *Impact*

The absence of task context in {{KafkaDataConsumer}} logs creates several 
challenges:
 * *Performance Debugging:* Unable to pinpoint which specific Spark tasks 
experience slow Kafka consumption.

 * *Support/Troubleshooting:* Difficult to correlate customer incidents or 
partition-level delays with specific task executions.

 * *Metrics Analysis:* Cannot easily aggregate Kafka consumption metrics per 
task for performance profiling or benchmarking.

 * *Log Analysis:* Requires complex log correlation or pattern matching across 
multiple executors to map Kafka operations to specific tasks.

  was:
Currently, the Kafka consumer logs generated by {{KafkaDataConsumer}} (in 
{{{}org.apache.spark.sql.kafka010{}}}) do not include the Spark {{TaskID}} or 
task context information.

This makes it difficult to correlate specific Kafka fetch or poll operations 
with the corresponding Spark tasks during debugging and performance 
investigations—especially when multiple tasks consume from the same Kafka topic 
partitions concurrently on different executors.

Adding the Spark {{TaskID}} To the Kafka consumer, log statements would 
significantly improve traceability and observability. It would allow engineers 
to:
 * Identify which Spark task triggered a specific Kafka poll or fetch

 * Diagnose task-level Kafka latency or deserialization bottlenecks

 * Simplify debugging of offset commit or consumer lag anomalies when multiple 
concurrent consumers are running

h3. *Current Behavior*

The current Kafka source consumer logging framework does not include {{TaskID}} 
information in the {{KafkaDataConsumer}} class.

*Example log output:*
{code:java}
14:30:16 INFO Executor: Running task 83.0 in stage 32.0 (TID 4188)
14:30:16 INFO KafkaBatchReaderFactoryWithRowBytesAccumulator: Creating Kafka 
reader... taskId=4188 partitionId=83
14:35:19 INFO KafkaDataConsumer: From Kafka topicPartition=******* 
groupId=******* read 12922 records through 80 polls (polled out 12979 records), 
taking 299911.706198 ms, during time span of 303617.473973 ms  <-- No Task 
Context Here
14:35:19 INFO Executor: Finished task 83.0 in stage 32.0 (TID 4188){code}
Although the Executor logs include the task context (TID), the 
KafkaDataConsumer logs do not, making it difficult to correlate Kafka read 
durations or latency with the specific Spark tasks that executed them. This 
limits visibility into which part of a job spent the most time consuming from 
Kafka.
h3. *Impact*

The absence of task context in {{KafkaDataConsumer}} logs creates several 
challenges:
 * *Performance Debugging:* Unable to pinpoint which specific Spark tasks 
experience slow Kafka consumption.

 * *Support/Troubleshooting:* Difficult to correlate customer incidents or 
partition-level delays with specific task executions.

 * *Metrics Analysis:* Cannot easily aggregate Kafka consumption metrics per 
task for performance profiling or benchmarking.

 * *Log Analysis:* Requires complex log correlation or pattern matching across 
multiple executors to map Kafka operations to specific tasks.


> SS | `Add TaskID to KafkaDataConsumer logs for better correlation between 
> Spark tasks and Kafka operations`
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-54039
>                 URL: https://issues.apache.org/jira/browse/SPARK-54039
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.4.1, 3.5.0, 3.5.2, 4.0.0
>            Reporter: Nishanth
>            Priority: Major
>             Fix For: 4.1.0
>
>
> Currently, the Kafka consumer logs generated by {{KafkaDataConsumer}} do not 
> include the Spark {{TaskID}} or task context information.
> This makes it difficult to correlate specific Kafka fetch or poll operations 
> with the corresponding Spark tasks during debugging and performance 
> investigations—especially when multiple tasks consume from the same Kafka 
> topic partitions concurrently on different executors.
> Adding the Spark {{TaskID}} To the Kafka consumer, log statements would 
> significantly improve traceability and observability. It would allow 
> engineers to:
>  * Identify which Spark task triggered a specific Kafka poll or fetch
>  * Diagnose task-level Kafka latency or deserialization bottlenecks
>  * Simplify debugging of offset commit or consumer lag anomalies when 
> multiple concurrent consumers are running
> h3. *Current Behavior*
> The current Kafka source consumer logging framework does not include 
> {{TaskID}} information in the {{KafkaDataConsumer}} class.
> *Example log output:*
> {code:java}
> 14:30:16 INFO Executor: Running task 83.0 in stage 32.0 (TID 4188)
> 14:30:16 INFO KafkaBatchReaderFactoryWithRowBytesAccumulator: Creating Kafka 
> reader... taskId=4188 partitionId=83
> 14:35:19 INFO KafkaDataConsumer: From Kafka topicPartition=******* 
> groupId=******* read 12922 records through 80 polls (polled out 12979 
> records), taking 299911.706198 ms, during time span of 303617.473973 ms  <-- 
> No Task Context Here
> 14:35:19 INFO Executor: Finished task 83.0 in stage 32.0 (TID 4188){code}
> Although the Executor logs include the task context (TID), the 
> KafkaDataConsumer logs do not, making it difficult to correlate Kafka read 
> durations or latency with the specific Spark tasks that executed them. This 
> limits visibility into which part of a job spent the most time consuming from 
> Kafka.
> h3. *Impact*
> The absence of task context in {{KafkaDataConsumer}} logs creates several 
> challenges:
>  * *Performance Debugging:* Unable to pinpoint which specific Spark tasks 
> experience slow Kafka consumption.
>  * *Support/Troubleshooting:* Difficult to correlate customer incidents or 
> partition-level delays with specific task executions.
>  * *Metrics Analysis:* Cannot easily aggregate Kafka consumption metrics per 
> task for performance profiling or benchmarking.
>  * *Log Analysis:* Requires complex log correlation or pattern matching 
> across multiple executors to map Kafka operations to specific tasks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to