[
https://issues.apache.org/jira/browse/HIVE-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903562#action_12903562
]
Joydeep Sen Sarma commented on HIVE-1467:
-----------------------------------------
@Ning - what about skew?
> dynamic partitioning should cluster by partitions
> -------------------------------------------------
>
> Key: HIVE-1467
> URL: https://issues.apache.org/jira/browse/HIVE-1467
> Project: Hadoop Hive
> Issue Type: Improvement
> Reporter: Joydeep Sen Sarma
> Assignee: Namit Jain
>
> (based on internal discussion with Ning). Dynamic partitioning should offer a
> mode where it clusters data by partition before writing out to each
> partition. This will reduce number of files. Details:
> 1. always use reducer stage
> 2. mapper sends to reducer based on partitioning column. ie. reducer =
> f(partition-cols)
> 3. f() can be made somewhat smart to:
> a. spread large partitions across multiple reducers - each mapper can
> maintain row count seen per partition - and then apply (whenever it sees a
> new row for a partition):
> * reducer = (row count / 64k) % numReducers
> Small partitions always go to one reducer. the larger the partition,
> the more the reducers. this prevents one reducer becoming bottleneck writing
> out one partition
> b. this still leaves the issue of very large number of splits. (64K rows
> from 10K mappers is pretty large). for this one can apply one slight
> modification:
> * reducer = (mapper-id/1024 + row-count/64k) % numReducers
> ie. - the first 1000 mappers always send the first 64K rows for one
> partition to the same reducer. the next 1000 send it to the next one. and so
> on.
> the constants 1024 and 64k are used just as an example. i don't know what the
> right numbers are. it's also clear that this is a case where we need hadoop
> to do only partitioning (and no sorting). this will be a useful feature to
> have in hadoop. that will reduce the overhead due to reducers.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.