[
https://issues.apache.org/jira/browse/SPARK-47017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eric Yang updated SPARK-47017:
------------------------------
Description:
The RDDScanExec wraps an internal RDD (as below). In our environment, we find
that this RDD is usually produced by some very large physical plans which
contain quite a few physical nodes. Those nodes may have various metrics which
are very useful for us to know what the execution looks like and any room for
optimization, etc.
{code:java}
case class RDDScanExec(
output: Seq[Attribute],
rdd: RDD[InternalRow], <-- this field
name: String, {code}
However, the physical plan and the metrics are invisible from the SQL DAG in
the Spark History Server. As it is an "existing RDD", the physical plan may be
found from some previous SQL. The metrics are not visible from that previous
SQL either. This is because the "definition" of these metrics are reported
along with the SparkListenerSQLExecutionStart event of the "previous SQL"
(where the physical plan of the RDDScanExec.rdd is in), but the metric values
are reported from the SparkListenerTaskEnd event of the tasks which are
attached to the SQL with RDDScanExec.
!ScanExistingRDD.jpg|width=336,height=296!
Do we consider showing the physical plan and metrics of the RDDScanExec.rdd
(the "Scan Existing RDD" node in the above DAG). For example, it may be shown
as a "leg" (similar to but not the same as a child) in the DAG, or something
else that may show the physical plan and metrics?
was:
The RDDScanExec wraps an internal RDD (as below). In our environment, we find
that this RDD is usually produced by some very large physical plans which
contain quite a few physical nodes. Those nodes may have various metrics which
are very useful for us to know what the execution looks like and any room for
optimization, etc.
{code:java}
case class RDDScanExec(
output: Seq[Attribute],
rdd: RDD[InternalRow], <-- this field
name: String, {code}
However, the physical plan and the metrics are invisible from the SQL DAG in
the Spark History Server. As it is an "existing RDD", the physical plan may be
found from some previous SQL. The metrics are not visible from that previous
SQL either. This is because the "definition" of these metrics are reported
along with the SparkListenerSQLExecutionStart event of the "previous SQL"
(where the physical plan of the RDDScanExec.rdd is in), but the metric values
are reported from the SparkListenerTaskEnd event of the tasks which are
attached to the SQL with RDDScanExec.
!image-2024-02-09-09-34-33-442.png|width=380,height=345!
Do we consider showing the physical plan and metrics of the RDDScanExec.rdd
(the "Scan Existing RDD" node in the above DAG). For example, it may be shown
as a "leg" (similar to but not the same as a child) in the DAG, or something
else that may show the physical plan and metrics?
> Show metrics of the physical plan of RDDScanExec's internal RDD in the
> history server
> -------------------------------------------------------------------------------------
>
> Key: SPARK-47017
> URL: https://issues.apache.org/jira/browse/SPARK-47017
> Project: Spark
> Issue Type: New Feature
> Components: Web UI
> Affects Versions: 3.4.0, 3.5.0
> Reporter: Eric Yang
> Priority: Major
> Attachments: ScanExistingRDD.jpg
>
>
> The RDDScanExec wraps an internal RDD (as below). In our environment, we find
> that this RDD is usually produced by some very large physical plans which
> contain quite a few physical nodes. Those nodes may have various metrics
> which are very useful for us to know what the execution looks like and any
> room for optimization, etc.
>
> {code:java}
> case class RDDScanExec(
> output: Seq[Attribute],
> rdd: RDD[InternalRow], <-- this field
> name: String, {code}
>
> However, the physical plan and the metrics are invisible from the SQL DAG in
> the Spark History Server. As it is an "existing RDD", the physical plan may
> be found from some previous SQL. The metrics are not visible from that
> previous SQL either. This is because the "definition" of these metrics are
> reported along with the SparkListenerSQLExecutionStart event of the "previous
> SQL" (where the physical plan of the RDDScanExec.rdd is in), but the metric
> values are reported from the SparkListenerTaskEnd event of the tasks which
> are attached to the SQL with RDDScanExec.
> !ScanExistingRDD.jpg|width=336,height=296!
>
> Do we consider showing the physical plan and metrics of the RDDScanExec.rdd
> (the "Scan Existing RDD" node in the above DAG). For example, it may be shown
> as a "leg" (similar to but not the same as a child) in the DAG, or something
> else that may show the physical plan and metrics?
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]