[jira] [Updated] (HIVE-17206) make a version of Compactor specific to unbucketed tables

Eugene Koifman (JIRA) Fri, 09 Mar 2018 13:52:20 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-17206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eugene Koifman updated HIVE-17206:
----------------------------------
    Description: 
current Compactor will work but is not optimized/flexible enough

The current compactor is designed to generate the number of splits equal to the 
number of buckets in the table.   That is the degree of parallelism.

For unbucketed tables, the same is used but the "number of buckets" is derived 
from the files found in the deltas.  For small writes, there will likely be 
just 1 bucket_00000 file.  For large writes, the parallelism of the write 
determines the number of output files.

Need to make sure Compactor can control parallelism for unbucketed tables as it 
wishes.  For example, hash partition all records (by ROW__ID?) into N disjoint 
sets.


  was:current Compactor will work but is not optimized/flexible enough


> make a version of Compactor specific to unbucketed tables
> ---------------------------------------------------------
>
>                 Key: HIVE-17206
>                 URL: https://issues.apache.org/jira/browse/HIVE-17206
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Transactions
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Major
>
> current Compactor will work but is not optimized/flexible enough
> The current compactor is designed to generate the number of splits equal to 
> the number of buckets in the table.   That is the degree of parallelism.
> For unbucketed tables, the same is used but the "number of buckets" is 
> derived from the files found in the deltas.  For small writes, there will 
> likely be just 1 bucket_00000 file.  For large writes, the parallelism of the 
> write determines the number of output files.
> Need to make sure Compactor can control parallelism for unbucketed tables as 
> it wishes.  For example, hash partition all records (by ROW__ID?) into N 
> disjoint sets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-17206) make a version of Compactor specific to unbucketed tables

Reply via email to