[ 
https://issues.apache.org/jira/browse/HIVE-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17365753#comment-17365753
 ] 

Marta Kuczora commented on HIVE-25257:
--------------------------------------

Pushed to master! Thanks a lot [~lpinter] for the review!

> Incorrect row order validation for query-based MAJOR compaction
> ---------------------------------------------------------------
>
>                 Key: HIVE-25257
>                 URL: https://issues.apache.org/jira/browse/HIVE-25257
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>            Reporter: Marta Kuczora
>            Assignee: Marta Kuczora
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In the insert query of the query-based MAJOR compaction, there is this 
> function call: "validate_acid_sort_order(ROW__ID.writeId, ROW__ID.bucketId, 
> ROW__ID.rowId)".
> This is to validate if the order of the rows is correct. This validation is 
> done by the GenericUDFValidateAcidSortOrder class and it assumes that the 
> rows are in increasing order by bucketProperty, originalTransactionId and 
> rowId. 
> But actually the rows should be ordered by originalTransactionId, 
> bucketProperty and rowId, otherwise the delete deltas cannot be applied 
> correctly. And this is the order what the MR MAJOR compaction writes and how 
> the split groups are created for the query-based MAJOR compaction. It doesn't 
> cause any issue until there is only one bucketProperty in the files, but as 
> soon as there are multiple bucketProperties in the same file, the validation 
> will fail. This can be reproduced by running multiple merge statements after 
> each other.
> For example:
> {noformat}
> CREATE TABLE transactions (id int,value string) STORED AS ORC TBLPROPERTIES 
> ('transactional'='true');
> INSERT INTO transactions VALUES
> (1, 'value_1'),
> (2, 'value_2'),
> (3, 'value_3'),
> (4, 'value_4'),
> (5, 'value_5');
> CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
> INSERT INTO merge_source_1 VALUES 
> (1, 'newvalue_1'),
> (2, 'newvalue_2'),
> (3, 'newvalue_3'),
> (6, 'value_6'),
> (7, 'value_7');
> MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID 
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value 
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> CREATE TABLE merge_source_2(
>  ID int,
>  value string)
> STORED AS ORC;
> INSERT INTO merge_source_2 VALUES
> (1, 'newestvalue_1'),
> (2, 'newestvalue_2'),
> (5, 'newestvalue_5'),
> (7, 'newestvalue_7'),
> (8, 'value_18);
> MERGE INTO transactions AS T 
> USING merge_source_2 AS S
> ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> ALTER TABLE transactions COMPACT 'MAJOR';
> {noformat}
> The MAJOR compaction will fail with the following error:
> {noformat}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Wrong sort order 
> of Acid rows detected for the rows: 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@4d3ef25e
>  and 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@1c9df436
>       at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder.evaluate(GenericUDFValidateAcidSortOrder.java:80)
> {noformat}
> So the validation doesn't check for the correct row order. The correct order 
> is originalTransactionId, bucketProperty, rowId.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to