[
https://issues.apache.org/jira/browse/SPARK-26723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladimir Matveev updated SPARK-26723:
-------------------------------------
Attachment: Screen Shot 2019-01-24 at 4.13.14 PM.png
Screen Shot 2019-01-24 at 4.13.02 PM.png
> Spark web UI only shows parts of SQL query graphs for queries with persist
> operations
> -------------------------------------------------------------------------------------
>
> Key: SPARK-26723
> URL: https://issues.apache.org/jira/browse/SPARK-26723
> Project: Spark
> Issue Type: Bug
> Components: Web UI
> Affects Versions: 2.3.2
> Reporter: Vladimir Matveev
> Priority: Major
> Attachments: Screen Shot 2019-01-24 at 4.13.02 PM.png, Screen Shot
> 2019-01-24 at 4.13.14 PM.png
>
>
> Currently it looks like the SQL view in Spark UI will truncate the graph on
> the nodes corresponding to persist operations on the dataframe, only showing
> everything after "LocalTableScan". This is *very* inconvenient, because in a
> common case when you have a heavy computation and want to persist it before
> writing to multiple outputs with some minor preprocessing, you lose almost
> the entire graph with potentially very useful information in it.
> The query plans below the graph, however, show the full query, including all
> computations before persists. Unfortunately, for complex queries looking into
> the plan is unfeasible, and graph visualization becomes a very helpful tool;
> with persist, it is apparently broken.
> You can verify it in Spark Shell with a very simple example:
> {code}
> import org.apache.spark.sql.{functions => f}
> import org.apache.spark.sql.expressions.Window
> val query = Vector(1, 2, 3).toDF()
> .select(($"value".cast("long") * f.rand).as("value"))
> .withColumn("valueAvg", f.avg($"value") over Window.orderBy("value"))
> query.show()
> query.persist().show()
> {code}
> Here the same query is executed first without persist, and then with it. If
> you now navigate to the Spark web UI SQL page, you'll see two queries, but
> their graphs will be radically different: the one without persist will
> contain the whole transformation with exchange, sort and window steps, while
> the one with persist will only contain only a LocalTableScan step with some
> intermediate transformations needed for `show`.
> After some looking into Spark code, I think that the reason for this is that
> the `org.apache.spark.sql.execution.SparkPlanInfo#fromSparkPlan` method
> (which is used to serialize a plan before emitting the
> SparkListenerSQLExecutionStart event) constructs the `SparkPlanInfo` object
> from a `SparkPlan` object incorrectly, because if you invoke the `toString`
> method on `SparkPlan` you'll see the entire plan, but the `SparkPlanInfo`
> object will only contain nodes corresponding to actions after `persist`.
> However, my knowledge of Spark internals is not deep enough to understand how
> to fix this, and how SparkPlanInfo.fromSparkPlan is different from what
> SparkPlan.toString does.
> This can be observed on Spark 2.3.2, but given that 2.4.0 code of
> SparkPlanInfo does not seem to change much since 2.3.2, I'd expect that it
> could be reproduced there too.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]