[
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16322561#comment-16322561
]
Sahil Takiar commented on HIVE-18368:
-------------------------------------
{quote} Can we get rid of code reference such as at
repartitionAndSortWithinPartitions at SortByShuffler.java:57. they don't seem
useful. {quote} I agree they aren't useful for HoS users. At first I thought
removing them wasn't possible, but there may be a way to do this, which would
be pretty cool. Working on a fix, may take another week to figure out, the APIs
aren't really documented.
{quote} Can you clarify what's the format of an RDD specification as shown in
each line of the output. {quote} Take {{Reducer 5 (SORT, 1) ShuffledRDD\[24\]
at sortByKey at SortByShuffler.java:51 \[\]}} as an example:
* {{Reducer 5}} is the Hive {{BaseWork}} name
* {{SORT}} is the edge type (taken from the {{SparkEdgeProperty{{)
* {{(1)}} is the number of partitions for the stage (taken from the
{{SparkEdgeProperty}})
* {{ShuffledRDD}} is the RDD type
* {{\[24\]}} is the RDD id
* {{sortByKey}} is the RDD transformation that created this RDD
* {{SortByShuffler.java:51}} is the line number that created this RDD
* {{\[\]}} I'm not sure what this is exactly
{quote} We can Skip SparkTran entirely, but need to have a clear mapping from
Work to RDD {quote} Yes, this is the main goal of this patch. An easy way to
map {{BaseWork}} objects to {{RDD}}.
{quote} Why the num of partitions of MapInput is 0 {quote} Thats just because
the job I ran didn't have any data in the underlying tables.
{quote} It seems confusing to have 2 RDDs having the same work name {quote}
Yes, I can play with the names a bit so its clearer. I'm not sure if
{{ShuffleTran}} is the best name, all the {{Tran}} objects are internal
implementation details of HoS that end users probably don't need to know about
(another reason why I removed the {{Tran}} graph).
I'll continue working on addressing the comments in the RB too. Hope to have an
updated patch sometime next week.
> Improve Spark Debug RDD Graph
> -----------------------------
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
> Issue Type: Sub-task
> Components: Spark
> Reporter: Sahil Takiar
> Assignee: Sahil Takiar
> Attachments: HIVE-18368.1.patch, HIVE-18368.2.patch, Spark UI - Named
> RDDs.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between
> different {{SparkTran}}, what shuffle types are used, and what trans are
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects,
> RDDs, and BaseWorks. Edge should include information about number of
> partitions, shuffle types, Spark operations used, etc.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)