[
https://issues.apache.org/jira/browse/HIVE-17206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721921#comment-16721921
]
Eugene Koifman edited comment on HIVE-17206 at 12/15/18 1:01 AM:
-----------------------------------------------------------------
this would mean the either we have to change ROW_IDs during compaction - which
we cannot unless compaction is made to run under X lock or break the
relationship between bucket_N file name and ROW__ID.bucketid property of the
rows in the file - this would mean all delete events have to be localized at
the task rather that just those in the delete_delta/bucketN. Since HIVE-19890
we only localize event from matching bucket files
was (Author: ekoifman):
this would mean the either we have to change ROW_IDs during compaction - which
we cannot unless compaction is made to run under X lock or break the
relationship between bucket_N file name and ROW__ID.bucketid property of the
rows in the file - this would mean all delete events have to be localized at
the task rather that just those in the delete_delta/bucketN
> make a version of Compactor specific to unbucketed tables
> ---------------------------------------------------------
>
> Key: HIVE-17206
> URL: https://issues.apache.org/jira/browse/HIVE-17206
> Project: Hive
> Issue Type: Sub-task
> Components: Transactions
> Reporter: Eugene Koifman
> Assignee: Eugene Koifman
> Priority: Major
>
> current Compactor will work but is not optimized/flexible enough
> The current compactor is designed to generate the number of splits equal to
> the number of buckets in the table. That is the degree of parallelism.
> For unbucketed tables, the same is used but the "number of buckets" is
> derived from the files found in the deltas. For small writes, there will
> likely be just 1 bucket_00000 file. For large writes, the parallelism of the
> write determines the number of output files.
> Need to make sure Compactor can control parallelism for unbucketed tables as
> it wishes. For example, hash partition all records (by ROW__ID?) into N
> disjoint sets.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)