[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312109#comment-16312109 ]
Sahil Takiar commented on HIVE-18368: ------------------------------------- * Spark provides a nice RDD graph via {{RDD#toDebugString}} - I replaced the {{SparkPlan#logSparkPlan}} and {{SparkUtilities#rddGraphToString}} with this graph. It includes all the info from both of these graphs + more info. It's very similar to the info that is showed in the Spark Web UI. An example is below. * Added explicit names for each RDD; the name is derived from the name of the {{BaseWork}} that corresponds to the RDD, along with the {{SparkEdgeProperty}} (if there is one). The example below shows this in detail. ** The nice thing about adding explicit names is that they show up in Spark Web UI too, which can be very useful for mapping a Hive Explain Plan to the Spark RDD DAG ** The name includes the number of partitions for the RDD as well as whether or not the RDD is cached * I originally wanted to find a way to display this in the {{EXPLAIN EXTENDED}} output, but for now that may be a bit difficult, because the {{SparkPlan}} is only generated in the {{RemoteDriver}} - its probably possible to generate the {{SparkPlan}} somewhere in the {{ExplainTask}}, but I'll save that for a later JIRA * The Spark RDD Graph is printed at INFO level, which I think should help with debugging * I've attached a screenshot of what the the Spark Web UI looks like with named RDDs Spark RDD Graph: {code} (1) Reducer 5 (1) MapPartitionsRDD[25] at mapPartitionsToPair at ReduceTran.java:41 [] | Reducer 5 (SORT, 1) ShuffledRDD[24] at sortByKey at SortByShuffler.java:51 [] +-(166) Reducer 4 (166) MapPartitionsRDD[23] at mapPartitionsToPair at ReduceTran.java:41 [] | Reducer 4 (PARTITION-LEVEL SORT, 166) ShuffledRDD[22] at repartitionAndSortWithinPartitions at SortByShuffler.java:57 [] +-(328) UnionRDD (328) UnionRDD[21] at union at SparkPlan.java:70 [] | Reducer 3 (328) MapPartitionsRDD[19] at mapPartitionsToPair at ReduceTran.java:41 [] | Reducer 3 (PARTITION-LEVEL SORT, 328) ShuffledRDD[18] at repartitionAndSortWithinPartitions at SortByShuffler.java:57 [] +-(874) UnionRDD (874) UnionRDD[17] at union at SparkPlan.java:70 [] | UnionRDD (874) UnionRDD[16] at union at SparkPlan.java:70 [] | Reducer 2 (437) MapPartitionsRDD[11] at mapPartitionsToPair at ReduceTran.java:41 [] | Reducer 2 (GROUP, 437) MapPartitionsRDD[10] at groupByKey at GroupByShuffler.java:31 [] | ShuffledRDD[9] at groupByKey at GroupByShuffler.java:31 [] +-(0) Map 1 (0) MapPartitionsRDD[8] at mapPartitionsToPair at MapTran.java:41 [] | Map 1 (store_sales, 0) HadoopRDD[4] at hadoopRDD at SparkPlanGenerator.java:203 [] | Reducer 8 (437) MapPartitionsRDD[14] at mapPartitionsToPair at ReduceTran.java:41 [] | Reducer 8 (GROUP PARTITION-LEVEL SORT, 437) ShuffledRDD[13] at repartitionAndSortWithinPartitions at SortByShuffler.java:57 [] +-(0) Map 7 (0) MapPartitionsRDD[12] at mapPartitionsToPair at MapTran.java:41 [] | Map 7 (store_sales, 0) HadoopRDD[5] at hadoopRDD at SparkPlanGenerator.java:203 [] | Map 10 (0) MapPartitionsRDD[15] at mapPartitionsToPair at MapTran.java:41 [] | Map 10 (store, 0) HadoopRDD[6] at hadoopRDD at SparkPlanGenerator.java:203 [] | Map 11 (0) MapPartitionsRDD[20] at mapPartitionsToPair at MapTran.java:41 [] | Map 11 (item, 0) HadoopRDD[7] at hadoopRDD at SparkPlanGenerator.java:203 [] {code} > Improve Spark Debug RDD Graph > ----------------------------- > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark > Reporter: Sahil Takiar > Assignee: Sahil Takiar > Attachments: Spark UI - Named RDDs.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029)