[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

Szehon Ho (JIRA) Thu, 21 Aug 2014 19:07:29 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106358#comment-14106358
 ]


Szehon Ho commented on HIVE-7384:
---------------------------------

1.  I thought that TotalOrderPartition was only in order-by case, for 
hive.optimize.sampling.orderby=true, and not for joins?  Just my reading of it, 
I'll take a second look and update if wrong.

2.  Auto-parallelism looks like a Tez feature, that will auto-calculate the 
number of reducers based on some input from Hive (upper/lower bound).

Today in Spark we are taking numReducers from what hive query optimizer gives 
us, and during shuffle stage put that as the number of RDD partitions of the 
shuffle output (reducer input).  Spark has some defaults if we dont set it 
explicitly (the doc says its based on the number of partitions already in the 
largest parent RDD) 
[http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism|http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism].
  I dont know off top of my head of any option for Spark to give us a better 
number, if thats what you're asking?

> Research into reduce-side join [Spark Branch]
> ---------------------------------------------
>
>                 Key: HIVE-7384
>                 URL: https://issues.apache.org/jira/browse/HIVE-7384
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Szehon Ho
>         Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, 
> sales_products.txt, sales_stores.txt
>
>
> Hive's join operator is very sophisticated, especially for reduce-side join. 
> While we expect that other types of join, such as map-side join and SMB 
> map-side join, will work out of the box with our design, there may be some 
> complication in reduce-side join, which extensively utilizes key tag and 
> shuffle behavior. Our design principle prefers to making Hive implementation 
> work out of box also, which might requires new functionality from Spark. The 
> tasks is to research into this area, identifying requirements for Spark 
> community and the work to be done on Hive to make reduce-side join work.
> A design doc might be needed for this. For more information, please refer to 
> the overall design doc on wiki.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

Reply via email to