[ 
https://issues.apache.org/jira/browse/HIVE-10989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585436#comment-14585436
 ] 

Rui Li commented on HIVE-10989:
-------------------------------

Hi [~xuefuz], these flags should only be set for the MapWork that handles the 
big table, i.e. in this case the skewed data. Previously, we set the flags for 
all the MapWork including those for the small table. This was copied from MR, 
where there's only one MapWork for the big table, and small tables are 
processed in MapredLocalWork. So the 3rd part makes our implementation inline 
with the MR version.

Also some performance data in case you wanna know. I tested joining the skewed 
data using 6 mappers (configured) vs 2 mappers (default). And the performance 
is 31s vs 43s. The improvement should be more obvious on bigger data.

> HoS can't control number of map tasks for runtime skew join [Spark Branch]
> --------------------------------------------------------------------------
>
>                 Key: HIVE-10989
>                 URL: https://issues.apache.org/jira/browse/HIVE-10989
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Rui Li
>            Assignee: Rui Li
>         Attachments: HIVE-10989.1-spark.patch
>
>
> Flags {{hive.skewjoin.mapjoin.map.tasks}} and 
> {{hive.skewjoin.mapjoin.min.split}} are used to control the number of map 
> tasks for the map join of runtime skew join. They work well for MR but have 
> no effect for spark.
> This makes runtime skew join less useful, i.e. we just end up with slow 
> mappers instead of reducers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to