[
https://issues.apache.org/jira/browse/HIVE-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17365753#comment-17365753
]
Marta Kuczora commented on HIVE-25257:
--------------------------------------
Pushed to master! Thanks a lot [~lpinter] for the review!
> Incorrect row order validation for query-based MAJOR compaction
> ---------------------------------------------------------------
>
> Key: HIVE-25257
> URL: https://issues.apache.org/jira/browse/HIVE-25257
> Project: Hive
> Issue Type: Bug
> Components: Transactions
> Reporter: Marta Kuczora
> Assignee: Marta Kuczora
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> In the insert query of the query-based MAJOR compaction, there is this
> function call: "validate_acid_sort_order(ROW__ID.writeId, ROW__ID.bucketId,
> ROW__ID.rowId)".
> This is to validate if the order of the rows is correct. This validation is
> done by the GenericUDFValidateAcidSortOrder class and it assumes that the
> rows are in increasing order by bucketProperty, originalTransactionId and
> rowId.
> But actually the rows should be ordered by originalTransactionId,
> bucketProperty and rowId, otherwise the delete deltas cannot be applied
> correctly. And this is the order what the MR MAJOR compaction writes and how
> the split groups are created for the query-based MAJOR compaction. It doesn't
> cause any issue until there is only one bucketProperty in the files, but as
> soon as there are multiple bucketProperties in the same file, the validation
> will fail. This can be reproduced by running multiple merge statements after
> each other.
> For example:
> {noformat}
> CREATE TABLE transactions (id int,value string) STORED AS ORC TBLPROPERTIES
> ('transactional'='true');
> INSERT INTO transactions VALUES
> (1, 'value_1'),
> (2, 'value_2'),
> (3, 'value_3'),
> (4, 'value_4'),
> (5, 'value_5');
> CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
> INSERT INTO merge_source_1 VALUES
> (1, 'newvalue_1'),
> (2, 'newvalue_2'),
> (3, 'newvalue_3'),
> (6, 'value_6'),
> (7, 'value_7');
> MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> CREATE TABLE merge_source_2(
> ID int,
> value string)
> STORED AS ORC;
> INSERT INTO merge_source_2 VALUES
> (1, 'newestvalue_1'),
> (2, 'newestvalue_2'),
> (5, 'newestvalue_5'),
> (7, 'newestvalue_7'),
> (8, 'value_18);
> MERGE INTO transactions AS T
> USING merge_source_2 AS S
> ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> ALTER TABLE transactions COMPACT 'MAJOR';
> {noformat}
> The MAJOR compaction will fail with the following error:
> {noformat}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Wrong sort order
> of Acid rows detected for the rows:
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@4d3ef25e
> and
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@1c9df436
> at
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder.evaluate(GenericUDFValidateAcidSortOrder.java:80)
> {noformat}
> So the validation doesn't check for the correct row order. The correct order
> is originalTransactionId, bucketProperty, rowId.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)