[ 
https://issues.apache.org/jira/browse/HIVE-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234701#comment-16234701
 ] 

Prasanth Jayachandran commented on HIVE-17935:
----------------------------------------------

bq. Do you think the possible performance regression for some jobs to be large? 
Unfortunately, not quantifiable. Overhead is essentially sort + shuffle + new 
tasks spin up for reduce tasks. If partition column count is low and data size 
is small, the regression factor will be completely different than the case with 
large data set. 

> Turn on hive.optimize.sort.dynamic.partition by default
> -------------------------------------------------------
>
>                 Key: HIVE-17935
>                 URL: https://issues.apache.org/jira/browse/HIVE-17935
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Andrew Sherman
>            Assignee: Andrew Sherman
>            Priority: Major
>         Attachments: HIVE-17935.1.patch, HIVE-17935.2.patch
>
>
> The config option hive.optimize.sort.dynamic.partition is an optimization for 
> Hive’s dynamic partitioning feature. It was originally implemented in 
> [HIVE-6455|https://issues.apache.org/jira/browse/HIVE-6455]. With this 
> optimization, the dynamic partition columns and bucketing columns (in case of 
> bucketed tables) are sorted before being fed to the reducers. Since the 
> partitioning and bucketing columns are sorted, each reducer can keep only one 
> record writer open at any time thereby reducing the memory pressure on the 
> reducers. There were some early problems with this optimization and it was 
> disabled by default in HiveConf in 
> [HIVE-8151|https://issues.apache.org/jira/browse/HIVE-8151]. Since then 
> setting hive.optimize.sort.dynamic.partition=true has been used to solve 
> problems where dynamic partitioning produces with (1) too many small files on 
> HDFS, which is bad for the cluster and can increase overhead for future Hive 
> queries over those partitions, and (2) OOM issues in the map tasks because it 
> trying to simultaneously write to 100 different files. 
> It now seems that the feature is probably mature enough that it can be 
> enabled by default.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to