[ 
https://issues.apache.org/jira/browse/HIVE-29272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18033689#comment-18033689
 ] 

Dmitriy Fingerman commented on HIVE-29272:
------------------------------------------

Hi [~kuczoram], can you please set affected version(s)?

> Query-based MINOR compaction should not consider minOpenWriteId
> ---------------------------------------------------------------
>
>                 Key: HIVE-29272
>                 URL: https://issues.apache.org/jira/browse/HIVE-29272
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Marta Kuczora
>            Assignee: Marta Kuczora
>            Priority: Major
>              Labels: pull-request-available
>
> In certain scenarios the query-based MINOR compaction produces empty delta 
> file. In ACID tables it will be automatically cleaned up like no compaction 
> happened, but on insert-only tables it causes data loss.
> This issue happens if there is an aborted and an open transaction on the 
> compacted table.
> Let’s see an example:
>  * Run an insert which creates delta_0000001_0000001 (writeId=1)
>  * Start an insert and abort the transaction (writeId=2)
>  * Run an insert which creates delta_0000003_0000003 (writeId=3)
>  * Run an insert which creates delta_0000004_0000004 (writeId=4), but before 
> it finishes, start the MINOR compaction
>  * When the compaction is finished the table will contain the following files:
> delta_0000001_0000003
> delta_0000004_0000004
> delta_0000004_0000004/000000_0
> delta_0000001_0000001
> delta_0000001_0000001/000000_0
> delta_0000003_0000003
> delta_0000001_0000001/000000_0
>  * It can be seen that the delta_0000001_0000003 directory (which was 
> produced by the compactor) is empty.
>  * When the Cleaner runs, it will remove delta_0000001_0000001 and 
> delta_0000003_0000003, so the data in them will be lost.
> This happens because of this check in the MINOR compaction:
>  
> {code:java}
> long minWriteID = validWriteIdList.getMinOpenWriteId() == null ? 1 : 
> validWriteIdList.getMinOpenWriteId();
> long highWatermark = validWriteIdList.getHighWatermark();
> List<AcidUtils.ParsedDelta> deltas = 
> dir.getCurrentDirectories().stream().filter(
>         delta -> delta.isDeleteDelta() == isDeleteDelta && 
> delta.getMaxWriteId() <= highWatermark && delta.getMinWriteId() >= minWriteID)
>     .collect(Collectors.toList());
> if (deltas.isEmpty()) {
>   query.setLength(0); // no alter query needed; clear StringBuilder
>   return;
> } {code}
> If the table has aborted and open transactions, the minOpenWriteId will be 
> set. In the example it will be 4. 
> When the ValidCompactorWriteIdList is created in the 
> TxnUtils.createValidCompactWriteIdList the highWaterMark will be set to 
> minOpenWriteId-1, so this will ensure that the compaction range is below the 
> minOpenWriteId.
> But in the minor compaction's code the minOpenWriteId is considered as the 
> lower limit, so it wants to compact deltas which are above this values. This 
> is not correct, it seems like a misunderstanding this minOpenWriteId values. 
> In the example the compaction should consider delta_1 and delta_3, but none 
> of them fulfills the conditions "delta.getMinWriteId() >= minWriteID" as the 
> minWriteID=minOpenWriteId=4 here.
> This check in the MINOR compaction code is not correct, I think it is safe to 
> leave out checking against the minOpenWriteId as the highWatermark already 
> adjusted to it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to