[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

Sahil Takiar (JIRA) Thu, 04 Jan 2018 14:15:33 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312109#comment-16312109
 ]


Sahil Takiar commented on HIVE-18368:
-------------------------------------

* Spark provides a nice RDD graph via {{RDD#toDebugString}} - I replaced the 
{{SparkPlan#logSparkPlan}} and {{SparkUtilities#rddGraphToString}} with this 
graph. It includes all the info from both of these graphs + more info. It's 
very similar to the info that is showed in the Spark Web UI. An example is 
below.
* Added explicit names for each RDD; the name is derived from the name of the 
{{BaseWork}} that corresponds to the RDD, along with the {{SparkEdgeProperty}} 
(if there is one). The example below shows this in detail.
** The nice thing about adding explicit names is that they show up in Spark Web 
UI too, which can be very useful for mapping a Hive Explain Plan to the Spark 
RDD DAG
** The name includes the number of partitions for the RDD as well as whether or 
not the RDD is cached
* I originally wanted to find a way to display this in the {{EXPLAIN EXTENDED}} 
output, but for now that may be a bit difficult, because the {{SparkPlan}} is 
only generated in the {{RemoteDriver}} - its probably possible to generate the 
{{SparkPlan}} somewhere in the {{ExplainTask}}, but I'll save that for a later 
JIRA
* The Spark RDD Graph is printed at INFO level, which I think should help with 
debugging
* I've attached a screenshot of what the the Spark Web UI looks like with named 
RDDs

Spark RDD Graph:

{code}
(1) Reducer 5 (1) MapPartitionsRDD[25] at mapPartitionsToPair at 
ReduceTran.java:41 []
 |  Reducer 5 (SORT, 1) ShuffledRDD[24] at sortByKey at SortByShuffler.java:51 
[]
 +-(166) Reducer 4 (166) MapPartitionsRDD[23] at mapPartitionsToPair at 
ReduceTran.java:41 []
     |   Reducer 4 (PARTITION-LEVEL SORT, 166) ShuffledRDD[22] at 
repartitionAndSortWithinPartitions at SortByShuffler.java:57 []
     +-(328) UnionRDD (328) UnionRDD[21] at union at SparkPlan.java:70 []
         |   Reducer 3 (328) MapPartitionsRDD[19] at mapPartitionsToPair at 
ReduceTran.java:41 []
         |   Reducer 3 (PARTITION-LEVEL SORT, 328) ShuffledRDD[18] at 
repartitionAndSortWithinPartitions at SortByShuffler.java:57 []
         +-(874) UnionRDD (874) UnionRDD[17] at union at SparkPlan.java:70 []
             |   UnionRDD (874) UnionRDD[16] at union at SparkPlan.java:70 []
             |   Reducer 2 (437) MapPartitionsRDD[11] at mapPartitionsToPair at 
ReduceTran.java:41 []
             |   Reducer 2 (GROUP, 437) MapPartitionsRDD[10] at groupByKey at 
GroupByShuffler.java:31 []
             |   ShuffledRDD[9] at groupByKey at GroupByShuffler.java:31 []
             +-(0) Map 1 (0) MapPartitionsRDD[8] at mapPartitionsToPair at 
MapTran.java:41 []
                |  Map 1 (store_sales, 0) HadoopRDD[4] at hadoopRDD at 
SparkPlanGenerator.java:203 []
             |   Reducer 8 (437) MapPartitionsRDD[14] at mapPartitionsToPair at 
ReduceTran.java:41 []
             |   Reducer 8 (GROUP PARTITION-LEVEL SORT, 437) ShuffledRDD[13] at 
repartitionAndSortWithinPartitions at SortByShuffler.java:57 []
             +-(0) Map 7 (0) MapPartitionsRDD[12] at mapPartitionsToPair at 
MapTran.java:41 []
                |  Map 7 (store_sales, 0) HadoopRDD[5] at hadoopRDD at 
SparkPlanGenerator.java:203 []
             |   Map 10 (0) MapPartitionsRDD[15] at mapPartitionsToPair at 
MapTran.java:41 []
             |   Map 10 (store, 0) HadoopRDD[6] at hadoopRDD at 
SparkPlanGenerator.java:203 []
         |   Map 11 (0) MapPartitionsRDD[20] at mapPartitionsToPair at 
MapTran.java:41 []
         |   Map 11 (item, 0) HadoopRDD[7] at hadoopRDD at 
SparkPlanGenerator.java:203 []
{code}

> Improve Spark Debug RDD Graph
> -----------------------------
>
>                 Key: HIVE-18368
>                 URL: https://issues.apache.org/jira/browse/HIVE-18368
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>         Attachments: Spark UI - Named RDDs.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

Reply via email to