Re: [PR] [SPARK-54223][PYTHON] Add task context and data metrics to Python runner logs [spark]

via GitHub Sun, 23 Nov 2025 03:54:30 -0800


Nishanth28 commented on code in PR #52931:
URL: https://github.com/apache/spark/pull/52931#discussion_r2554015288



##########
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala:
##########
@@ -1015,7 +1036,10 @@ private[spark] class PythonRunner(
         try {
           stream.readInt() match {
             case length if length >= 0 =>
-              PythonWorkerUtils.readBytes(length, stream)
+              val data = PythonWorkerUtils.readBytes(length, stream)
+              recordsProcessed += 1
+              totalDataReceived += length

Review Comment:
   ## Log Output Comparison
   
   ### BEFORE the Fix ❌
   ```
   INFO PythonRunner: Times: total = 1250, boot = 45, init = 67, finish = 32
   ```
   **Problems**:
   - No task identification
   - No batch/data metrics
   - Difficult to correlate with Spark UI
   
   ### AFTER the Fix ✅
   ```
   INFO PythonRunner: Starting Python task execution - task 83.0 in stage 32 
(TID 4188)
   INFO PythonRunner: Times: total = 1250, boot = 45, init = 67, finish = 32 - 
Batches: 1000, Data: 2.45 MB - task 83.0 in stage 32 (TID 4188)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-54223][PYTHON] Add task context and data metrics to Python runner logs [spark]

Reply via email to