[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

Sahil Takiar (JIRA) Thu, 11 Jan 2018 09:06:13 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16322561#comment-16322561
 ]


Sahil Takiar commented on HIVE-18368:
-------------------------------------

{quote} Can we get rid of code reference such as at 
repartitionAndSortWithinPartitions at SortByShuffler.java:57. they don't seem 
useful. {quote} I agree they aren't useful for HoS users. At first I thought 
removing them wasn't possible, but there may be a way to do this, which would 
be pretty cool. Working on a fix, may take another week to figure out, the APIs 
aren't really documented.

{quote} Can you clarify what's the format of an RDD specification as shown in 
each line of the output. {quote} Take {{Reducer 5 (SORT, 1) ShuffledRDD\[24\] 
at sortByKey at SortByShuffler.java:51 \[\]}} as an example:

* {{Reducer 5}} is the Hive {{BaseWork}} name
* {{SORT}} is the edge type (taken from the {{SparkEdgeProperty{{)
* {{(1)}} is the number of partitions for the stage (taken from the 
{{SparkEdgeProperty}})
* {{ShuffledRDD}} is the RDD type
* {{\[24\]}} is the RDD id
* {{sortByKey}} is the RDD transformation that created this RDD
* {{SortByShuffler.java:51}} is the line number that created this RDD
* {{\[\]}} I'm not sure what this is exactly

{quote} We can Skip SparkTran entirely, but need to have a clear mapping from 
Work to RDD {quote} Yes, this is the main goal of this patch. An easy way to 
map {{BaseWork}} objects to {{RDD}}.

{quote} Why the num of partitions of MapInput is 0 {quote} Thats just because 
the job I ran didn't have any data in the underlying tables.

{quote} It seems confusing to have 2 RDDs having the same work name {quote} 
Yes, I can play with the names a bit so its clearer. I'm not sure if 
{{ShuffleTran}} is the best name, all the {{Tran}} objects are internal 
implementation details of HoS that end users probably don't need to know about 
(another reason why I removed the {{Tran}} graph).

I'll continue working on addressing the comments in the RB too. Hope to have an 
updated patch sometime next week.

> Improve Spark Debug RDD Graph
> -----------------------------
>
>                 Key: HIVE-18368
>                 URL: https://issues.apache.org/jira/browse/HIVE-18368
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>         Attachments: HIVE-18368.1.patch, HIVE-18368.2.patch, Spark UI - Named 
> RDDs.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

Reply via email to