Nishanth28 opened a new pull request, #52931:
URL: https://github.com/apache/spark/pull/52931

   ### What changes were proposed in this pull request?
   
   Currently, the log messages in PythonRunner and related Python execution 
classes do not include Spark task context information during Python UDF 
execution.
   
   This makes it harder to correlate Python worker timing metrics and data 
processing statistics with the specific Spark tasks that executed the UDFs, 
especially when debugging performance issues or data skew in production 
environments.
   
   This improvement adds task context details along with data processing 
metrics to the log statements in PythonRunner and PythonUDFRunner classes to 
enhance traceability and debugging of Python UDF execution.
   
   **Current Behaviour**
   
   When examining executor logs, there is a disconnect between task execution 
logs and Python runner logs:
   
   ```
   INFO PythonRunner: Times: total = 2529, boot = 478, init = 31, finish = 2020
   
   ❌ NO TASK CONTEXT - Cannot correlate with task 4188
   ❌ NO DATA METRICS - Cannot see records processed or data size
   ```
   
   **Expected Behaviour**
   
   After this enhancement, logs include task context information and data 
metrics:
   
   ```
   INFO PythonRunner: Starting Python task execution (Stage 32, Attempt 0) - 
task 83.0 in stage 32 (TID 4188)
   
   PythonRunner: Times: total = 2529, boot = 478, init = 31, finish = 2020 - 
Records: 10000, Data: 2.45 MB - task 83.0 in stage 32 (TID 4188)
   
   ✅ INCLUDES TASK CONTEXT - Easy to correlate with task 4188
   ✅ INCLUDES DATA METRICS - Shows 10000 records processed, 2.45 MB data
   ```
   
   ### Why are the changes needed?
   
   **Enable seamless correlation between task execution and Python UDF 
operations:**
   
   ### Does this PR introduce any user-facing change?
   
   No
   
   ### How was this patch tested?
   
   **Run existing test suite:**
   
   ```bash
   ./build/mvn -pl core -am test 
-DwildcardSuites=org.apache.spark.deploy.PythonRunnerSuite
   ```
   
   **Result:**
   ```
   Run completed in 1 second, 626 milliseconds.
   Total number of tests run: 3
   Suites: completed 2, aborted 0
   Tests: succeeded 3, failed 0, canceled 0, ignored 0, pending 0
   All tests passed.
   [INFO] 
------------------------------------------------------------------------
   [INFO] Reactor Summary for Spark Project Parent POM 4.2.0-SNAPSHOT:
   [INFO] 
   [INFO] Spark Project Parent POM ........................... SUCCESS [  1.554 
s]
   [INFO] Spark Project Tags ................................. SUCCESS [  1.885 
s]
   [INFO] Spark Project Common Java Utils .................... SUCCESS [  2.938 
s]
   [INFO] Spark Project Common Utils ......................... SUCCESS [ 10.154 
s]
   [INFO] Spark Project Local DB ............................. SUCCESS [  7.751 
s]
   [INFO] Spark Project Networking ........................... SUCCESS [01:02 
min]
   [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 11.541 
s]
   [INFO] Spark Project Variant .............................. SUCCESS [  2.006 
s]
   [INFO] Spark Project Unsafe ............................... SUCCESS [  7.967 
s]
   [INFO] Spark Project Launcher ............................. SUCCESS [  3.467 
s]
   [INFO] Spark Project Core ................................. SUCCESS [02:31 
min]
   [INFO] 
------------------------------------------------------------------------
   [INFO] BUILD SUCCESS
   [INFO] 
------------------------------------------------------------------------
   [INFO] Total time:  04:23 min
   [INFO] Finished at: 2025-11-07T10:49:09+05:30
   [INFO] 
------------------------------------------------------------------------
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to