[
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338446#comment-16338446
]
Sahil Takiar commented on HIVE-18368:
-------------------------------------
[~xuefuz] thanks for the comments
{quote} As to the duplication, is it possible that we name the call site
differently so it is less confusing, such as "In ReduceTran" {quote} Yeah, we
could, but I'm not sure its very useful for users. Ideally, they shouldn't need
to understand what a ReduceTran is. The call site is also displayed on the
Completed-Stages.png page, so I think its useful to have it set to something
like {{Reducer 2}}
{quote} One thing unclear to me is the reason we changed the test case. {quote}
Whoops, I'll remove that.
> Improve Spark Debug RDD Graph
> -----------------------------
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
> Issue Type: Sub-task
> Components: Spark
> Reporter: Sahil Takiar
> Assignee: Sahil Takiar
> Priority: Major
> Attachments: Completed Stages.png, HIVE-18368.1.patch,
> HIVE-18368.2.patch, HIVE-18368.3.patch, Job Ids.png, Stage DAG 1.png, Stage
> DAG 2.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between
> different {{SparkTran}}, what shuffle types are used, and what trans are
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects,
> RDDs, and BaseWorks. Edge should include information about number of
> partitions, shuffle types, Spark operations used, etc.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)