[ 
https://issues.apache.org/jira/browse/HIVE-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-6455:
-----------------------------

    Attachment: HIVE-6455.7.patch

Added a fix that solved an issue with stats aggregation when stats aggregation 
key exceeds the max key prefix length. Fixed other failing tests.

> Scalable dynamic partitioning and bucketing optimization
> --------------------------------------------------------
>
>                 Key: HIVE-6455
>                 URL: https://issues.apache.org/jira/browse/HIVE-6455
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.13.0
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>              Labels: optimization
>         Attachments: HIVE-6455.1.patch, HIVE-6455.1.patch, HIVE-6455.2.patch, 
> HIVE-6455.3.patch, HIVE-6455.4.patch, HIVE-6455.4.patch, HIVE-6455.5.patch, 
> HIVE-6455.6.patch, HIVE-6455.7.patch
>
>
> The current implementation of dynamic partition works by keeping at least one 
> record writer open per dynamic partition directory. In case of bucketing 
> there can be multispray file writers which further adds up to the number of 
> open record writers. The record writers of column oriented file format (like 
> ORC, RCFile etc.) keeps some sort of in-memory buffers (value buffer or 
> compression buffers) open all the time to buffer up the rows and compress 
> them before flushing it to disk. Since these buffers are maintained per 
> column basis the amount of constant memory that will required at runtime 
> increases as the number of partitions and number of columns per partition 
> increases. This often leads to OutOfMemory (OOM) exception in mappers or 
> reducers depending on the number of open record writers. Users often tune the 
> JVM heapsize (runtime memory) to get over such OOM issues. 
> With this optimization, the dynamic partition columns and bucketing columns 
> (in case of bucketed tables) are sorted before being fed to the reducers. 
> Since the partitioning and bucketing columns are sorted, each reducers can 
> keep only one record writer open at any time thereby reducing the memory 
> pressure on the reducers. This optimization is highly scalable as the number 
> of partition and number of columns per partition increases at the cost of 
> sorting the columns.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to