[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338053#comment-16338053
 ] 

Sahil Takiar commented on HIVE-18368:
-------------------------------------

[~xuefuz], [~lirui] thoughts on the updated patch? The latest screenshots are 
attached.

There is one weird thing I wasn't able to fix. You'll notice that the 
Stage-DAG.png images look like they have duplicate info - e.g.:

{code}
Reducer 2 (400) [7]
Reducer 2
{code}

The first line is the RDD name, the second like is the RDD call site. Living 
with this duplicate metadata is necessary to get the results in 
Completed-Stages.png

Spark distinguishes between RDD names and RDD call sites. By default, the RDD 
call site is what line of code created the RDD. However, the call site can be 
overwritten for each RDD.

In Spark, each stage is described by the call site of the final RDD in the 
stage (e.g. what you see in Completed-Stages.png).

> Improve Spark Debug RDD Graph
> -----------------------------
>
>                 Key: HIVE-18368
>                 URL: https://issues.apache.org/jira/browse/HIVE-18368
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: Completed Stages.png, HIVE-18368.1.patch, 
> HIVE-18368.2.patch, HIVE-18368.3.patch, Job Ids.png, Stage DAG 1.png, Stage 
> DAG 2.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to