Re: [PR] [SPARK-54223][PYTHON] Add task context and data metrics to Python runner logs [spark]

via GitHub Thu, 13 Nov 2025 20:43:04 -0800


Nishanth28 commented on code in PR #52931:
URL: https://github.com/apache/spark/pull/52931#discussion_r2525720637



##########
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala:
##########
@@ -134,6 +133,15 @@ private[spark] object BasePythonRunner extends Logging {
     } else None
   }
 
+  /**
+   * Creates a task identifier string for logging following Spark's standard 
format.
+   * Format: "task <partition>.<attempt> in stage <stageId> (TID 
<taskAttemptId>)"
+   */
+  private[spark] def taskIdentifier(context: TaskContext): String = {
+    s"task ${context.partitionId()}.${context.attemptNumber()} in stage 
${context.stageId()} " +

Review Comment:
   Thank you @HyukjinKwon for your input on this. Yes, TaskContext does provide 
all those methods (partitionId(), 
   stageId(), taskAttemptId(), etc.).
   
   I created the `taskIdentifier()` helper method because it's used **9 times** 
throughout PythonRunner in various log statements, mainly for observability, so 
that while debugging the UDF-related issues, it's easy to track the task 
details and how much data processing is being performed.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-54223][PYTHON] Add task context and data metrics to Python runner logs [spark]

Reply via email to