Eric Yang created SPARK-47017:
---------------------------------
Summary: Show metrics of the physical plan of RDDScanExec's
internal RDD in the history server
Key: SPARK-47017
URL: https://issues.apache.org/jira/browse/SPARK-47017
Project: Spark
Issue Type: New Feature
Components: Web UI
Affects Versions: 3.5.0, 3.4.0
Reporter: Eric Yang
The RDDScanExec wraps an internal RDD (as below). In our environment, we find
that this RDD is usually produced by some very large physical plans which
contain quite a few physical nodes. Those nodes may have various metrics which
are very useful for us to know what the execution looks like and any room for
optimization, etc.
{code:java}
case class RDDScanExec(
output: Seq[Attribute],
rdd: RDD[InternalRow], <-- this field
name: String, {code}
However, the physical plan and the metrics are invisible from the SQL DAG in
the Spark History Server. As it is an "existing RDD", the physical plan may be
found from some previous SQL. The metrics are not visible from that previous
SQL either. This is because the "definition" of these metrics are reported
along with the SparkListenerSQLExecutionStart event of the "previous SQL"
(where the physical plan of the RDDScanExec.rdd is in), but the metric values
are reported from the SparkListenerTaskEnd event of the tasks which are
attached to the SQL with RDDScanExec.
!image-2024-02-09-09-34-33-442.png|width=380,height=345!
Do we consider showing the physical plan and metrics of the RDDScanExec.rdd
(the "Scan Existing RDD" node in the above DAG). For example, it may be shown
as a "leg" (similar to but not the same as a child) in the DAG, or something
else that may show the physical plan and metrics?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]