[ 
https://issues.apache.org/jira/browse/HIVE-17206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721921#comment-16721921
 ] 

Eugene Koifman edited comment on HIVE-17206 at 12/15/18 1:01 AM:
-----------------------------------------------------------------

this would mean the either we have to change ROW_IDs during compaction - which 
we cannot unless compaction is made to run under X lock or break the 
relationship between bucket_N file name and ROW__ID.bucketid property of the 
rows in the file - this would mean all delete events have to be localized at 
the task rather that just those in the delete_delta/bucketN.  Since HIVE-19890 
we only localize event from matching bucket files


was (Author: ekoifman):
this would mean the either we have to change ROW_IDs during compaction - which 
we cannot unless compaction is made to run under X lock or break the 
relationship between bucket_N file name and ROW__ID.bucketid property of the 
rows in the file - this would mean all delete events have to be localized at 
the task rather that just those in the delete_delta/bucketN

> make a version of Compactor specific to unbucketed tables
> ---------------------------------------------------------
>
>                 Key: HIVE-17206
>                 URL: https://issues.apache.org/jira/browse/HIVE-17206
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Transactions
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Major
>
> current Compactor will work but is not optimized/flexible enough
> The current compactor is designed to generate the number of splits equal to 
> the number of buckets in the table.   That is the degree of parallelism.
> For unbucketed tables, the same is used but the "number of buckets" is 
> derived from the files found in the deltas.  For small writes, there will 
> likely be just 1 bucket_00000 file.  For large writes, the parallelism of the 
> write determines the number of output files.
> Need to make sure Compactor can control parallelism for unbucketed tables as 
> it wishes.  For example, hash partition all records (by ROW__ID?) into N 
> disjoint sets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to