[
https://issues.apache.org/jira/browse/HIVE-27332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730892#comment-17730892
]
Sourabh Badhya commented on HIVE-27332:
---------------------------------------
Thanks [~veghlaci05] and [~dkuzmenko] for the reviews.
> Add retry backoff mechanism for abort cleanup
> ---------------------------------------------
>
> Key: HIVE-27332
> URL: https://issues.apache.org/jira/browse/HIVE-27332
> Project: Hive
> Issue Type: Sub-task
> Reporter: Sourabh Badhya
> Assignee: Sourabh Badhya
> Priority: Major
> Labels: pull-request-available
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> HIVE-27019 and HIVE-27020 added the functionality to directly clean data
> directories from aborted transactions without using Initiator & Worker.
> However, during the event of continuous failure during cleanup, the retry
> mechanism is initiated every single time. We need to add retry backoff
> mechanism to control the time required to initiate retry again and not
> continuously retry.
> There are widely 3 cases wherein retry due to abort cleanup is impacted -
> *1. Abort cleanup on the table failed + Compaction on the table failed.*
> *2. Abort cleanup on the table failed + Compaction on the table passed*
> *3. Abort cleanup on the table failed + No compaction on the table.*
> *Solution -*
> *We reuse COMPACTION_QUEUE table to store the retry metadata -*
> *Advantage: Most of the fields with respect to retry are present in
> COMPACTION_QUEUE. Hence we can use the same for storing retry metadata. A
> compaction type called ABORT_CLEANUP ('c') is introduced. The CQ_STATE will
> remain ready for cleaning for such records.*
> *Actions performed by TaskHandler in the case of failure -*
> *AbortTxnCleaner -*
> Action: Just add retry details in the queue table during the abort failure.
> *CompactionCleaner -*
> Action: If compaction on the same table is successful, delete the retry entry
> in markCleaned when removing any TXN_COMPONENTS entries except when there are
> no uncompacted aborts. We do not want to be in a situation where there is a
> queue entry for a table but there is no record in TXN_COMPONENTS associated
> with the same table.
> *Advantage: Expecting no performance issues with this approach. Since we
> delete 1 record most of the times for the associated table/partition.*
--
This message was sent by Atlassian Jira
(v8.20.10#820010)