[
https://issues.apache.org/jira/browse/HIVE-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HIVE-25258:
----------------------------------
Labels: pull-request-available (was: )
> Incorrect row order after query-based MINOR compaction
> ------------------------------------------------------
>
> Key: HIVE-25258
> URL: https://issues.apache.org/jira/browse/HIVE-25258
> Project: Hive
> Issue Type: Bug
> Components: Transactions
> Reporter: Marta Kuczora
> Assignee: Marta Kuczora
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> The query based MINOR compaction uses the following sorting order in its
> inner query: `bucket`, `originalTransaction`, `rowId`, as it can be seen in
> the
> [code|https://github.com/apache/hive/blob/d0bbe76ad626244802d062b0a93a9f1cd4fc5f20/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactionQueryBuilder.java#L474-L476].
> But actually the rows should be ordered by originalTransactionId,
> bucketProperty and rowId, otherwise the delete deltas cannot be applied
> correctly. And this is the order what the MR MAJOR and MR MINOR compactions
> write.
> The sorting order used by the query-based MINOR compaction can lead to
> duplicated rows when running the compaction after multiple merge statements.
> This issue can be reproduced for example by running the following queries:
> {noformat}
> CREATE TABLE transactions(id int,value string) STORED AS ORC TBLPROPERTIES
> ('transactional'='true');
> INSERT INTO transactions VALUES
> (1, 'value_01'),(2, 'value_02'),(3, 'value_03'),(4, 'value_04'),(5,
> 'value_05'),(6, 'value_06'),(7, 'value_07'),(8, 'value_08');
> CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
> INSERT INTO merge_source_1 VALUES (1, 'newvalue_1'),(2, 'newvalue_2'),(4,
> 'newvalue_4'),(6, 'newvalue_6'),(9, 'value_9'),(10, 'value_10'),(11,
> 'value_11'),(12, 'value_12');
> MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> CREATE TABLE merge_source_2(ID int, value string) STORED AS ORC;
> INSERT INTO merge_source_2 VALUES
> (2, 'newestvalue_2'),(4, 'newestvalue_4'),(6, 'newestvalue_6'),(10,
> 'newestvalue_10'),(11, 'newestvalue_11'),(13, 'value_13'),(14, 'value_14');
> MERGE INTO transactions AS T
> USING merge_source_2 AS S
> ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> ALTER TABLE transactions COMPACT 'MINOR';
> CREATE TABLE merge_source_3(ID int, value string) STORED AS ORC;
> INSERT INTO merge_source_3 VALUES
> (1, 'latestvalue_1'),(4, 'latestvalue_4'),(5, 'latestvalue_5'),(9,
> 'latestvalue_9'),(11, 'latestvalue_11'),(13, 'latestvalue_13'),(15,
> 'value_15');
> MERGE INTO transactions AS T
> USING merge_source_3 AS S
> ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> ALTER TABLE transactions COMPACT 'MINOR';
> {noformat}
> Running a select after the second compaction finished will return duplicated
> rows:
> {noformat}
> select * from transactions order by id;
> +------------------+---------------------+
> | transactions.id | transactions.value |
> +------------------+---------------------+
> | 1 | newvalue_1 |
> | 1 | latestvalue_1 |
> | 2 | newestvalue_2 |
> | 2 | newvalue_2 |
> | 3 | value_03 |
> | 4 | latestvalue_4 |
> | 4 | newvalue_4 |
> | 5 | latestvalue_5 |
> | 6 | newvalue_6 |
> | 6 | newestvalue_6 |
> | 7 | value_07 |
> | 8 | value_08 |
> | 9 | latestvalue_9 |
> | 10 | newestvalue_10 |
> | 11 | latestvalue_11 |
> | 12 | value_12 |
> | 13 | latestvalue_13 |
> | 14 | value_14 |
> | 15 | value_15 |
> +------------------+---------------------+
> {noformat}
> If the same queries are run with MR MINOR compaction, instead of the
> query-based MINOR compaction, the select will return the correct result:
> {noformat}
> +------------------+---------------------+
> | transactions.id | transactions.value |
> +------------------+---------------------+
> | 1 | latestvalue_1 |
> | 2 | newestvalue_2 |
> | 3 | value_03 |
> | 4 | latestvalue_4 |
> | 5 | latestvalue_5 |
> | 6 | newestvalue_6 |
> | 7 | value_07 |
> | 8 | value_08 |
> | 9 | latestvalue_9 |
> | 10 | newestvalue_10 |
> | 11 | latestvalue_11 |
> | 12 | value_12 |
> | 13 | latestvalue_13 |
> | 14 | value_14 |
> | 15 | value_15 |
> +------------------+---------------------+
> {noformat}
> The content of the bucket files in the delta and delete delta directories
> after the query-based and MR compactions look like this.
> Query-based
> {noformat}
> Processing data file
> tmp/transactions/delta_0000001_0000004_v0000429/bucket_00000 [length: 947]
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":0,"currenttransaction":1,"row":{"id":1,"value":"value_01"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":1,"currenttransaction":1,"row":{"id":2,"value":"value_02"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":2,"currenttransaction":1,"row":{"id":3,"value":"value_03"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":3,"currenttransaction":1,"row":{"id":4,"value":"value_04"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":4,"currenttransaction":1,"row":{"id":5,"value":"value_05"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":5,"currenttransaction":1,"row":{"id":6,"value":"value_06"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":6,"currenttransaction":1,"row":{"id":7,"value":"value_07"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":7,"currenttransaction":1,"row":{"id":8,"value":"value_08"}}
> {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":0,"currenttransaction":2,"row":{"id":9,"value":"value_9"}}
> {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":1,"currenttransaction":2,"row":{"id":10,"value":"value_10"}}
> {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":2,"currenttransaction":2,"row":{"id":11,"value":"value_11"}}
> {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":3,"currenttransaction":2,"row":{"id":12,"value":"value_12"}}
> {"operation":0,"originaltransaction":3,"bucket":536870912,"rowid":0,"currenttransaction":3,"row":{"id":13,"value":"value_13"}}
> {"operation":0,"originaltransaction":3,"bucket":536870912,"rowid":1,"currenttransaction":3,"row":{"id":14,"value":"value_14"}}
> {"operation":0,"originaltransaction":4,"bucket":536870912,"rowid":0,"currenttransaction":4,"row":{"id":15,"value":"value_15"}}
> {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":0,"currenttransaction":2,"row":{"id":1,"value":"newvalue_1"}}
> {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":1,"currenttransaction":2,"row":{"id":2,"value":"newvalue_2"}}
> {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":2,"currenttransaction":2,"row":{"id":4,"value":"newvalue_4"}}
> {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":3,"currenttransaction":2,"row":{"id":6,"value":"newvalue_6"}}
> {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":0,"currenttransaction":3,"row":{"id":10,"value":"newestvalue_10"}}
> {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":1,"currenttransaction":3,"row":{"id":11,"value":"newestvalue_11"}}
> {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":2,"currenttransaction":3,"row":{"id":2,"value":"newestvalue_2"}}
> {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":3,"currenttransaction":3,"row":{"id":4,"value":"newestvalue_4"}}
> {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":4,"currenttransaction":3,"row":{"id":6,"value":"newestvalue_6"}}
> {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":0,"currenttransaction":4,"row":{"id":5,"value":"latestvalue_5"}}
> {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":1,"currenttransaction":4,"row":{"id":9,"value":"latestvalue_9"}}
> {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":2,"currenttransaction":4,"row":{"id":1,"value":"latestvalue_1"}}
> {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":3,"currenttransaction":4,"row":{"id":13,"value":"latestvalue_13"}}
> {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":4,"currenttransaction":4,"row":{"id":11,"value":"latestvalue_11"}}
> {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":5,"currenttransaction":4,"row":{"id":4,"value":"latestvalue_4"}}
> Processing data file
> tmp/transactions/delete_delta_0000001_0000004_v0000429/bucket_00000 [length:
> 713]
> {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":0,"currenttransaction":2,"row":null}
> {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":1,"currenttransaction":2,"row":null}
> {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":3,"currenttransaction":2,"row":null}
> {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":4,"currenttransaction":4,"row":null}
> {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":5,"currenttransaction":2,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870912,"rowid":0,"currenttransaction":4,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870912,"rowid":1,"currenttransaction":3,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870912,"rowid":2,"currenttransaction":3,"row":null}
> {"operation":2,"originaltransaction":3,"bucket":536870912,"rowid":0,"currenttransaction":4,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":0,"currenttransaction":4,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":1,"currenttransaction":3,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":2,"currenttransaction":3,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":3,"currenttransaction":3,"row":null}
> {"operation":2,"originaltransaction":3,"bucket":536870913,"rowid":1,"currenttransaction":4,"row":null}
> {"operation":2,"originaltransaction":3,"bucket":536870913,"rowid":3,"currenttransaction":4,"row":null}
> _____________________________________________________________________________________________________________________
> {noformat}
> MR
> {noformat}
> Processing data file
> tmp/transactions/delta_0000001_0000004_v0000479/bucket_00000 [length: 1002]
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":1,"row":{"id":1,"value":"value_01"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":1,"currentTransaction":1,"row":{"id":2,"value":"value_02"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":2,"currentTransaction":1,"row":{"id":3,"value":"value_03"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":3,"currentTransaction":1,"row":{"id":4,"value":"value_04"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":4,"currentTransaction":1,"row":{"id":5,"value":"value_05"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":5,"currentTransaction":1,"row":{"id":6,"value":"value_06"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":6,"currentTransaction":1,"row":{"id":7,"value":"value_07"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":7,"currentTransaction":1,"row":{"id":8,"value":"value_08"}}
> {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":2,"row":{"id":9,"value":"value_9"}}
> {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":1,"currentTransaction":2,"row":{"id":10,"value":"value_10"}}
> {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":2,"currentTransaction":2,"row":{"id":11,"value":"value_11"}}
> {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":3,"currentTransaction":2,"row":{"id":12,"value":"value_12"}}
> {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":0,"currentTransaction":2,"row":{"id":1,"value":"newvalue_1"}}
> {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":1,"currentTransaction":2,"row":{"id":2,"value":"newvalue_2"}}
> {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":2,"currentTransaction":2,"row":{"id":4,"value":"newvalue_4"}}
> {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":3,"currentTransaction":2,"row":{"id":6,"value":"newvalue_6"}}
> {"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":3,"row":{"id":13,"value":"value_13"}}
> {"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":{"id":14,"value":"value_14"}}
> {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":0,"currentTransaction":3,"row":{"id":10,"value":"newestvalue_10"}}
> {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":1,"currentTransaction":3,"row":{"id":11,"value":"newestvalue_11"}}
> {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":2,"currentTransaction":3,"row":{"id":2,"value":"newestvalue_2"}}
> {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":3,"currentTransaction":3,"row":{"id":4,"value":"newestvalue_4"}}
> {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":4,"currentTransaction":3,"row":{"id":6,"value":"newestvalue_6"}}
> {"operation":0,"originalTransaction":4,"bucket":536870912,"rowId":0,"currentTransaction":4,"row":{"id":15,"value":"value_15"}}
> {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":0,"currentTransaction":4,"row":{"id":5,"value":"latestvalue_5"}}
> {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":1,"currentTransaction":4,"row":{"id":9,"value":"latestvalue_9"}}
> {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":2,"currentTransaction":4,"row":{"id":1,"value":"latestvalue_1"}}
> {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":3,"currentTransaction":4,"row":{"id":13,"value":"latestvalue_13"}}
> {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":4,"currentTransaction":4,"row":{"id":11,"value":"latestvalue_11"}}
> {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":5,"currentTransaction":4,"row":{"id":4,"value":"latestvalue_4"}}
> _____________________________________________________________________________________________________________________
> Processing data file
> tmp/transactions/delete_delta_0000001_0000004_v0000479/bucket_00000 [length:
> 632]
> {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":2,"row":null}
> {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":1,"currentTransaction":2,"row":null}
> {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":3,"currentTransaction":2,"row":null}
> {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":4,"currentTransaction":4,"row":null}
> {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":5,"currentTransaction":2,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":4,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":2,"currentTransaction":3,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":0,"currentTransaction":4,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":1,"currentTransaction":3,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":2,"currentTransaction":3,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":3,"currentTransaction":3,"row":null}
> {"operation":2,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":4,"row":null}
> {"operation":2,"originalTransaction":3,"bucket":536870913,"rowId":1,"currentTransaction":4,"row":null}
> {"operation":2,"originalTransaction":3,"bucket":536870913,"rowId":3,"currentTransaction":4,"row":null}
> _____________________________________________________________________________________________________________________
> {noformat}
> It can be seen that when the query-based MINOR compaction was used, the
> sorting order of the rows is "bucket, originalTransactionId, rowId". But
> when the MR MINOR compaction was used, the order is "originalTransactionId,
> bucket, rowId".
> The ordering in the query-based compaction has to be fixed to be aligned with
> the MR compaction's ordering.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)