[
https://issues.apache.org/jira/browse/SPARK-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen updated SPARK-14679:
------------------------------
Assignee: Ryan Blue
> UI DAG visualization causes OOM generating data
> -----------------------------------------------
>
> Key: SPARK-14679
> URL: https://issues.apache.org/jira/browse/SPARK-14679
> Project: Spark
> Issue Type: Bug
> Components: Web UI
> Affects Versions: 1.6.1
> Reporter: Ryan Blue
> Assignee: Ryan Blue
> Fix For: 1.6.2, 2.0.0
>
>
> The UI will hit an OutOfMemoryException when generating the DAG visualization
> data for large Hive table scans. The problem is that data is being duplicated
> in the output for each RDD like cluster10 here:
> {code}
> digraph G {
> subgraph clusterstage_1 {
> label="Stage 1";
> subgraph cluster7 {
> label="TungstenAggregate";
> 9 [label="MapPartitionsRDD [9]\nrun at ThreadPoolExecutor.java:1142"];
> }
> subgraph cluster10 {
> label="HiveTableScan";
> 7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"];
> 6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"];
> 5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"];
> }
> subgraph cluster10 {
> label="HiveTableScan";
> 7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"];
> 6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"];
> 5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"];
> }
> subgraph cluster8 {
> label="ConvertToUnsafe";
> 8 [label="MapPartitionsRDD [8]\nrun at ThreadPoolExecutor.java:1142"];
> }
> subgraph cluster10 {
> label="HiveTableScan";
> 7 [label="MapPartitionsRDD [7]\nrun at ThreadPoolExecutor.java:1142"];
> 6 [label="MapPartitionsRDD [6]\nrun at ThreadPoolExecutor.java:1142"];
> 5 [label="HadoopRDD [5]\nrun at ThreadPoolExecutor.java:1142"];
> }
> }
> 8->9;
> 6->7;
> 5->6;
> 7->8;
> }
> {code}
> Hive has a large number of RDDs because it creates a RDD for each partition
> in the scan returned by the metastore. Each RDD in results in another copy of
> the. The data is built with a StringBuilder and copied into a String, so the
> memory required gets huge quickly.
> The cause is how the RDDOperationGraph gets generated. For each RDD, a nested
> chain of RDDOperationCluster is produced and those are merged. But, there is
> no implementation of equals for RDDOperationCluster, so they are always
> distinct and accumulated rather than
> [deduped|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala#L135].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]