[
https://issues.apache.org/jira/browse/HIVE-26674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
László Végh resolved HIVE-26674.
--------------------------------
Fix Version/s: 4.0.0
Target Version/s: 4.0.0
Resolution: Fixed
> REBALANCE type compaction
> -------------------------
>
> Key: HIVE-26674
> URL: https://issues.apache.org/jira/browse/HIVE-26674
> Project: Hive
> Issue Type: Improvement
> Reporter: László Végh
> Assignee: László Végh
> Priority: Major
> Labels: compaction
> Fix For: 4.0.0
>
>
> h2. Problem statement:
> Without explicit bucketing defined, bucket files are very sensitive to the
> amount of data loaded/modified in the table.
> When
> * there are initial or larger time-window loads or reloads beside smaller
> load schedules (like initial and monthly vs. daily loads)
> * or even if load scheduling is periodic but the volume of the data changes
> are not,
> * or even if data volume and periodicity are all balanced but runtime
> resources affect the loader application to run on different number of tasks
> The data loaded into non-explicitly bucketed full-acid ORC tables can lead to
> unbalanced bucketed tables over time!
> The number of buckets is calculated from the amount of data to be loaded. If
> the table is created with a huge amount of initial data (which will create
> several buckets), and then only a few records are added to it (which will be
> written only into the first 1-2 buckets), but frequently, the result will be
> that the data is unbalanced within the buckets. The first few buckets will
> contain much more data than the others.
> h2. Concept:
> h4. Rebalancing compaction
> A new compaction type (‘REBALANCE’) should be created to address the issue
> for badly balanced data among buckets. This compaction type would result in a
> table like an INSERT-OVERWRITE would lead to. New base and independent bucket
> indexes from the previous base or deltas. The new number of buckets can be
> optionally supplied, otherwise the new table would still have the same amount
> of buckets, but with re-balanced data.
> h4. Sorting
> Optionally, a sorting expression can be supplied, to be able to re-sort the
> data during the rebalance.
> The expression can be supplied in two ways:
> * Via the ALTER TABLE COMPACT:
> ALTER TABLE COMPACT <table> ‘REBALANCE’ ORDER BY <column> ASC|DESC
> h4. Manual rebalance
> The rebalance request can be created by using the ALTER TABLE COMPACT command
> (E.g. manual compaction).
> h4. Limitations
> * Rebalancing can be done only within partitions.
> * Rebalancing is not possible on explicitly bucketed (clustered) tables
> * Rebalancing is not possible via MR based compaction
> * Rebalancing is not supported on insert-only tables
> h2. Implications
> h4. Compaction request (DB schema) changes
> * A new compaction type (REBALANCE) must be added to the allowed compaction
> TYPES.
> * A new optional field (and nullable DB column) is required to store the
> number of requested implicit buckets.
> h4. ALTER TABLE COMPACT changes
> The ALTER TABLE COMPACT command must accept the
> * ‘REBALANCE’, compaction type
> * optionally the new number of the required buckets (... INTO \{N} BUCKETS).
> * Optionally the sorting expression (ORDER BY column ASC, columnB DESC)
> h4. Compactor changes
> Both the MR and query based compaction tasks must be enhanced with the
> ability to do a rebalancing compaction.
> h4. Query based compaction changes
> New compactor implementations are required:
> * Query based rebalance compactor for fully acid tables
> h4. MR based compaction changes
> MR is deprecated, rebalancing compaction will only be implemented, if it’s
> really easy to do so.
> h2. Open points
--
This message was sent by Atlassian Jira
(v8.20.10#820010)