[
https://issues.apache.org/jira/browse/HIVE-25258?focusedWorklogId=613339&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-613339
]
ASF GitHub Bot logged work on HIVE-25258:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 22/Jun/21 08:18
Start Date: 22/Jun/21 08:18
Worklog Time Spent: 10m
Work Description: kuczoram merged pull request #2410:
URL: https://github.com/apache/hive/pull/2410
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 613339)
Time Spent: 20m (was: 10m)
> Incorrect row order after query-based MINOR compaction
> ------------------------------------------------------
>
> Key: HIVE-25258
> URL: https://issues.apache.org/jira/browse/HIVE-25258
> Project: Hive
> Issue Type: Bug
> Components: Transactions
> Reporter: Marta Kuczora
> Assignee: Marta Kuczora
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> The query based MINOR compaction uses the following sorting order in its
> inner query: `bucket`, `originalTransaction`, `rowId`, as it can be seen in
> the
> [code|https://github.com/apache/hive/blob/d0bbe76ad626244802d062b0a93a9f1cd4fc5f20/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactionQueryBuilder.java#L474-L476].
> But actually the rows should be ordered by originalTransactionId,
> bucketProperty and rowId, otherwise the delete deltas cannot be applied
> correctly. And this is the order what the MR MAJOR and MR MINOR compactions
> write.
> The sorting order used by the query-based MINOR compaction can lead to
> duplicated rows when running the compaction after multiple merge statements.
> This issue can be reproduced for example by running the following queries:
> {noformat}
> CREATE TABLE transactions(id int,value string) STORED AS ORC TBLPROPERTIES
> ('transactional'='true');
> INSERT INTO transactions VALUES
> (1, 'value_01'),(2, 'value_02'),(3, 'value_03'),(4, 'value_04'),(5,
> 'value_05'),(6, 'value_06'),(7, 'value_07'),(8, 'value_08');
> CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
> INSERT INTO merge_source_1 VALUES (1, 'newvalue_1'),(2, 'newvalue_2'),(4,
> 'newvalue_4'),(6, 'newvalue_6'),(9, 'value_9'),(10, 'value_10'),(11,
> 'value_11'),(12, 'value_12');
> MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> CREATE TABLE merge_source_2(ID int, value string) STORED AS ORC;
> INSERT INTO merge_source_2 VALUES
> (2, 'newestvalue_2'),(4, 'newestvalue_4'),(6, 'newestvalue_6'),(10,
> 'newestvalue_10'),(11, 'newestvalue_11'),(13, 'value_13'),(14, 'value_14');
> MERGE INTO transactions AS T
> USING merge_source_2 AS S
> ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> ALTER TABLE transactions COMPACT 'MINOR';
> CREATE TABLE merge_source_3(ID int, value string) STORED AS ORC;
> INSERT INTO merge_source_3 VALUES
> (1, 'latestvalue_1'),(4, 'latestvalue_4'),(5, 'latestvalue_5'),(9,
> 'latestvalue_9'),(11, 'latestvalue_11'),(13, 'latestvalue_13'),(15,
> 'value_15');
> MERGE INTO transactions AS T
> USING merge_source_3 AS S
> ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> ALTER TABLE transactions COMPACT 'MINOR';
> {noformat}
> Running a select after the second compaction finished will return duplicated
> rows:
> {noformat}
> select * from transactions order by id;
> +------------------+---------------------+
> | transactions.id | transactions.value |
> +------------------+---------------------+
> | 1 | newvalue_1 |
> | 1 | latestvalue_1 |
> | 2 | newestvalue_2 |
> | 2 | newvalue_2 |
> | 3 | value_03 |
> | 4 | latestvalue_4 |
> | 4 | newvalue_4 |
> | 5 | latestvalue_5 |
> | 6 | newvalue_6 |
> | 6 | newestvalue_6 |
> | 7 | value_07 |
> | 8 | value_08 |
> | 9 | latestvalue_9 |
> | 10 | newestvalue_10 |
> | 11 | latestvalue_11 |
> | 12 | value_12 |
> | 13 | latestvalue_13 |
> | 14 | value_14 |
> | 15 | value_15 |
> +------------------+---------------------+
> {noformat}
> If the same queries are run with MR MINOR compaction, instead of the
> query-based MINOR compaction, the select will return the correct result:
> {noformat}
> +------------------+---------------------+
> | transactions.id | transactions.value |
> +------------------+---------------------+
> | 1 | latestvalue_1 |
> | 2 | newestvalue_2 |
> | 3 | value_03 |
> | 4 | latestvalue_4 |
> | 5 | latestvalue_5 |
> | 6 | newestvalue_6 |
> | 7 | value_07 |
> | 8 | value_08 |
> | 9 | latestvalue_9 |
> | 10 | newestvalue_10 |
> | 11 | latestvalue_11 |
> | 12 | value_12 |
> | 13 | latestvalue_13 |
> | 14 | value_14 |
> | 15 | value_15 |
> +------------------+---------------------+
> {noformat}
> The content of the bucket files in the delta and delete delta directories
> after the query-based and MR compactions look like this.
> Query-based
> {noformat}
> Processing data file
> tmp/transactions/delta_0000001_0000004_v0000429/bucket_00000 [length: 947]
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":0,"currenttransaction":1,"row":{"id":1,"value":"value_01"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":1,"currenttransaction":1,"row":{"id":2,"value":"value_02"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":2,"currenttransaction":1,"row":{"id":3,"value":"value_03"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":3,"currenttransaction":1,"row":{"id":4,"value":"value_04"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":4,"currenttransaction":1,"row":{"id":5,"value":"value_05"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":5,"currenttransaction":1,"row":{"id":6,"value":"value_06"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":6,"currenttransaction":1,"row":{"id":7,"value":"value_07"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":7,"currenttransaction":1,"row":{"id":8,"value":"value_08"}}
> {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":0,"currenttransaction":2,"row":{"id":9,"value":"value_9"}}
> {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":1,"currenttransaction":2,"row":{"id":10,"value":"value_10"}}
> {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":2,"currenttransaction":2,"row":{"id":11,"value":"value_11"}}
> {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":3,"currenttransaction":2,"row":{"id":12,"value":"value_12"}}
> {"operation":0,"originaltransaction":3,"bucket":536870912,"rowid":0,"currenttransaction":3,"row":{"id":13,"value":"value_13"}}
> {"operation":0,"originaltransaction":3,"bucket":536870912,"rowid":1,"currenttransaction":3,"row":{"id":14,"value":"value_14"}}
> {"operation":0,"originaltransaction":4,"bucket":536870912,"rowid":0,"currenttransaction":4,"row":{"id":15,"value":"value_15"}}
> {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":0,"currenttransaction":2,"row":{"id":1,"value":"newvalue_1"}}
> {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":1,"currenttransaction":2,"row":{"id":2,"value":"newvalue_2"}}
> {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":2,"currenttransaction":2,"row":{"id":4,"value":"newvalue_4"}}
> {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":3,"currenttransaction":2,"row":{"id":6,"value":"newvalue_6"}}
> {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":0,"currenttransaction":3,"row":{"id":10,"value":"newestvalue_10"}}
> {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":1,"currenttransaction":3,"row":{"id":11,"value":"newestvalue_11"}}
> {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":2,"currenttransaction":3,"row":{"id":2,"value":"newestvalue_2"}}
> {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":3,"currenttransaction":3,"row":{"id":4,"value":"newestvalue_4"}}
> {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":4,"currenttransaction":3,"row":{"id":6,"value":"newestvalue_6"}}
> {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":0,"currenttransaction":4,"row":{"id":5,"value":"latestvalue_5"}}
> {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":1,"currenttransaction":4,"row":{"id":9,"value":"latestvalue_9"}}
> {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":2,"currenttransaction":4,"row":{"id":1,"value":"latestvalue_1"}}
> {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":3,"currenttransaction":4,"row":{"id":13,"value":"latestvalue_13"}}
> {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":4,"currenttransaction":4,"row":{"id":11,"value":"latestvalue_11"}}
> {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":5,"currenttransaction":4,"row":{"id":4,"value":"latestvalue_4"}}
> Processing data file
> tmp/transactions/delete_delta_0000001_0000004_v0000429/bucket_00000 [length:
> 713]
> {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":0,"currenttransaction":2,"row":null}
> {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":1,"currenttransaction":2,"row":null}
> {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":3,"currenttransaction":2,"row":null}
> {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":4,"currenttransaction":4,"row":null}
> {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":5,"currenttransaction":2,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870912,"rowid":0,"currenttransaction":4,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870912,"rowid":1,"currenttransaction":3,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870912,"rowid":2,"currenttransaction":3,"row":null}
> {"operation":2,"originaltransaction":3,"bucket":536870912,"rowid":0,"currenttransaction":4,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":0,"currenttransaction":4,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":1,"currenttransaction":3,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":2,"currenttransaction":3,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":3,"currenttransaction":3,"row":null}
> {"operation":2,"originaltransaction":3,"bucket":536870913,"rowid":1,"currenttransaction":4,"row":null}
> {"operation":2,"originaltransaction":3,"bucket":536870913,"rowid":3,"currenttransaction":4,"row":null}
> _____________________________________________________________________________________________________________________
> {noformat}
> MR
> {noformat}
> Processing data file
> tmp/transactions/delta_0000001_0000004_v0000479/bucket_00000 [length: 1002]
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":1,"row":{"id":1,"value":"value_01"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":1,"currentTransaction":1,"row":{"id":2,"value":"value_02"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":2,"currentTransaction":1,"row":{"id":3,"value":"value_03"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":3,"currentTransaction":1,"row":{"id":4,"value":"value_04"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":4,"currentTransaction":1,"row":{"id":5,"value":"value_05"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":5,"currentTransaction":1,"row":{"id":6,"value":"value_06"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":6,"currentTransaction":1,"row":{"id":7,"value":"value_07"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":7,"currentTransaction":1,"row":{"id":8,"value":"value_08"}}
> {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":2,"row":{"id":9,"value":"value_9"}}
> {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":1,"currentTransaction":2,"row":{"id":10,"value":"value_10"}}
> {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":2,"currentTransaction":2,"row":{"id":11,"value":"value_11"}}
> {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":3,"currentTransaction":2,"row":{"id":12,"value":"value_12"}}
> {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":0,"currentTransaction":2,"row":{"id":1,"value":"newvalue_1"}}
> {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":1,"currentTransaction":2,"row":{"id":2,"value":"newvalue_2"}}
> {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":2,"currentTransaction":2,"row":{"id":4,"value":"newvalue_4"}}
> {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":3,"currentTransaction":2,"row":{"id":6,"value":"newvalue_6"}}
> {"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":3,"row":{"id":13,"value":"value_13"}}
> {"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":{"id":14,"value":"value_14"}}
> {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":0,"currentTransaction":3,"row":{"id":10,"value":"newestvalue_10"}}
> {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":1,"currentTransaction":3,"row":{"id":11,"value":"newestvalue_11"}}
> {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":2,"currentTransaction":3,"row":{"id":2,"value":"newestvalue_2"}}
> {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":3,"currentTransaction":3,"row":{"id":4,"value":"newestvalue_4"}}
> {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":4,"currentTransaction":3,"row":{"id":6,"value":"newestvalue_6"}}
> {"operation":0,"originalTransaction":4,"bucket":536870912,"rowId":0,"currentTransaction":4,"row":{"id":15,"value":"value_15"}}
> {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":0,"currentTransaction":4,"row":{"id":5,"value":"latestvalue_5"}}
> {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":1,"currentTransaction":4,"row":{"id":9,"value":"latestvalue_9"}}
> {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":2,"currentTransaction":4,"row":{"id":1,"value":"latestvalue_1"}}
> {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":3,"currentTransaction":4,"row":{"id":13,"value":"latestvalue_13"}}
> {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":4,"currentTransaction":4,"row":{"id":11,"value":"latestvalue_11"}}
> {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":5,"currentTransaction":4,"row":{"id":4,"value":"latestvalue_4"}}
> _____________________________________________________________________________________________________________________
> Processing data file
> tmp/transactions/delete_delta_0000001_0000004_v0000479/bucket_00000 [length:
> 632]
> {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":2,"row":null}
> {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":1,"currentTransaction":2,"row":null}
> {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":3,"currentTransaction":2,"row":null}
> {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":4,"currentTransaction":4,"row":null}
> {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":5,"currentTransaction":2,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":4,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":2,"currentTransaction":3,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":0,"currentTransaction":4,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":1,"currentTransaction":3,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":2,"currentTransaction":3,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":3,"currentTransaction":3,"row":null}
> {"operation":2,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":4,"row":null}
> {"operation":2,"originalTransaction":3,"bucket":536870913,"rowId":1,"currentTransaction":4,"row":null}
> {"operation":2,"originalTransaction":3,"bucket":536870913,"rowId":3,"currentTransaction":4,"row":null}
> _____________________________________________________________________________________________________________________
> {noformat}
> It can be seen that when the query-based MINOR compaction was used, the
> sorting order of the rows is "bucket, originalTransactionId, rowId". But
> when the MR MINOR compaction was used, the order is "originalTransactionId,
> bucket, rowId".
> The ordering in the query-based compaction has to be fixed to be aligned with
> the MR compaction's ordering.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)