[
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312109#comment-16312109
]
Sahil Takiar commented on HIVE-18368:
-------------------------------------
* Spark provides a nice RDD graph via {{RDD#toDebugString}} - I replaced the
{{SparkPlan#logSparkPlan}} and {{SparkUtilities#rddGraphToString}} with this
graph. It includes all the info from both of these graphs + more info. It's
very similar to the info that is showed in the Spark Web UI. An example is
below.
* Added explicit names for each RDD; the name is derived from the name of the
{{BaseWork}} that corresponds to the RDD, along with the {{SparkEdgeProperty}}
(if there is one). The example below shows this in detail.
** The nice thing about adding explicit names is that they show up in Spark Web
UI too, which can be very useful for mapping a Hive Explain Plan to the Spark
RDD DAG
** The name includes the number of partitions for the RDD as well as whether or
not the RDD is cached
* I originally wanted to find a way to display this in the {{EXPLAIN EXTENDED}}
output, but for now that may be a bit difficult, because the {{SparkPlan}} is
only generated in the {{RemoteDriver}} - its probably possible to generate the
{{SparkPlan}} somewhere in the {{ExplainTask}}, but I'll save that for a later
JIRA
* The Spark RDD Graph is printed at INFO level, which I think should help with
debugging
* I've attached a screenshot of what the the Spark Web UI looks like with named
RDDs
Spark RDD Graph:
{code}
(1) Reducer 5 (1) MapPartitionsRDD[25] at mapPartitionsToPair at
ReduceTran.java:41 []
| Reducer 5 (SORT, 1) ShuffledRDD[24] at sortByKey at SortByShuffler.java:51
[]
+-(166) Reducer 4 (166) MapPartitionsRDD[23] at mapPartitionsToPair at
ReduceTran.java:41 []
| Reducer 4 (PARTITION-LEVEL SORT, 166) ShuffledRDD[22] at
repartitionAndSortWithinPartitions at SortByShuffler.java:57 []
+-(328) UnionRDD (328) UnionRDD[21] at union at SparkPlan.java:70 []
| Reducer 3 (328) MapPartitionsRDD[19] at mapPartitionsToPair at
ReduceTran.java:41 []
| Reducer 3 (PARTITION-LEVEL SORT, 328) ShuffledRDD[18] at
repartitionAndSortWithinPartitions at SortByShuffler.java:57 []
+-(874) UnionRDD (874) UnionRDD[17] at union at SparkPlan.java:70 []
| UnionRDD (874) UnionRDD[16] at union at SparkPlan.java:70 []
| Reducer 2 (437) MapPartitionsRDD[11] at mapPartitionsToPair at
ReduceTran.java:41 []
| Reducer 2 (GROUP, 437) MapPartitionsRDD[10] at groupByKey at
GroupByShuffler.java:31 []
| ShuffledRDD[9] at groupByKey at GroupByShuffler.java:31 []
+-(0) Map 1 (0) MapPartitionsRDD[8] at mapPartitionsToPair at
MapTran.java:41 []
| Map 1 (store_sales, 0) HadoopRDD[4] at hadoopRDD at
SparkPlanGenerator.java:203 []
| Reducer 8 (437) MapPartitionsRDD[14] at mapPartitionsToPair at
ReduceTran.java:41 []
| Reducer 8 (GROUP PARTITION-LEVEL SORT, 437) ShuffledRDD[13] at
repartitionAndSortWithinPartitions at SortByShuffler.java:57 []
+-(0) Map 7 (0) MapPartitionsRDD[12] at mapPartitionsToPair at
MapTran.java:41 []
| Map 7 (store_sales, 0) HadoopRDD[5] at hadoopRDD at
SparkPlanGenerator.java:203 []
| Map 10 (0) MapPartitionsRDD[15] at mapPartitionsToPair at
MapTran.java:41 []
| Map 10 (store, 0) HadoopRDD[6] at hadoopRDD at
SparkPlanGenerator.java:203 []
| Map 11 (0) MapPartitionsRDD[20] at mapPartitionsToPair at
MapTran.java:41 []
| Map 11 (item, 0) HadoopRDD[7] at hadoopRDD at
SparkPlanGenerator.java:203 []
{code}
> Improve Spark Debug RDD Graph
> -----------------------------
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
> Issue Type: Sub-task
> Components: Spark
> Reporter: Sahil Takiar
> Assignee: Sahil Takiar
> Attachments: Spark UI - Named RDDs.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between
> different {{SparkTran}}, what shuffle types are used, and what trans are
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects,
> RDDs, and BaseWorks. Edge should include information about number of
> partitions, shuffle types, Spark operations used, etc.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)