[
https://issues.apache.org/jira/browse/SPARK-54223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nishanth updated SPARK-54223:
-----------------------------
Description:
Currently, log messages in PythonRunner do not include Spark Task Context
information such as the Task ID or Partition ID.
Additionally, the logs lack useful execution metrics (e.g., number of records
processed, data size), which makes it difficult to correlate Python process
behavior with specific Spark tasks.
When debugging UDF performance issues, hangs, or data skew, it’s challenging to
identify which task or dataset portion caused the issue.
was:
Currently, log messages in {{PythonRunner}} (used to execute Python UDFs) do
not include Spark Task Context information such as Task ID or Partition ID.
When debugging UDF performance issues or hangs, it’s difficult to trace which
Spark task or partition corresponds to the specific Python process that logged
the message.
> Add task context and data metrics to Python runner logs
> -------------------------------------------------------
>
> Key: SPARK-54223
> URL: https://issues.apache.org/jira/browse/SPARK-54223
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 3.5.1, 4.0.1
> Reporter: Nishanth
> Priority: Major
>
> Currently, log messages in PythonRunner do not include Spark Task Context
> information such as the Task ID or Partition ID.
> Additionally, the logs lack useful execution metrics (e.g., number of records
> processed, data size), which makes it difficult to correlate Python process
> behavior with specific Spark tasks.
> When debugging UDF performance issues, hangs, or data skew, it’s challenging
> to identify which task or dataset portion caused the issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]