[
https://issues.apache.org/jira/browse/HIVE-22538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983503#comment-16983503
]
Peter Vary commented on HIVE-22538:
-----------------------------------
[~jcamachorodriguez]: Checked one of the failures:Â TestTxnCommands.testDeleteIn
The problem is with the contents of the delete_delta directories.
After the patch:
{code:java}
[root@nightly7x-unsecure-1 tmp]# hive --orcfiledump -d
[..]warehouse/acidtbl/delete_delta_0000003_0000003_0000/bucket_00000
Processing data file file:/tmp/bucket_00000 [length: 686]
{"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":null}
{"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":3,"row":null}
________________________________________________________________________________________________________________________
[root@nightly7x-unsecure-1 tmp]# hive --orcfiledump -d
[..]warehouse/acidtbl/delete_delta_0000003_0000003_0000/bucket_00001
Processing data file file:/tmp/bucket_00001 [length: 698]
{"operation":2,"originalTransaction":2,"bucket":536936448,"rowId":1,"currentTransaction":3,"row":null}
{"operation":2,"originalTransaction":2,"bucket":536936448,"rowId":0,"currentTransaction":3,"row":null}
________________________________________________________________________________________________________________________
{code}
Before the patch:
{code:java}
[root@nightly7x-unsecure-1 tmp]# hive --orcfiledump -d
[..]warehouse/acidtbl/delete_delta_0000003_0000003_0000/bucket_00000
Processing data file file:/tmp/bucket_00000 [length: 686]
{"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":3,"row":null}
{"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":null}
________________________________________________________________________________________________________________________
[root@nightly7x-unsecure-1 tmp]# hive --orcfiledump -d
[..]warehouse/acidtbl/delete_delta_0000003_0000003_0000/bucket_00001
Processing data file file:/tmp/bucket_00001 [length: 698]
{"operation":2,"originalTransaction":2,"bucket":536936448,"rowId":0,"currentTransaction":3,"row":null}
{"operation":2,"originalTransaction":2,"bucket":536936448,"rowId":1,"currentTransaction":3,"row":null}
________________________________________________________________________________________________________________________
{code}
*Notice the difference of the ordering of the rows!* That is what causes the
problem.
The table is created by this command:
{code:java}
create table acidTbl(a int, b int) clustered by (a) into 2 buckets stored as
orc TBLPROPERTIES ('transactional'='true') {code}
My understanding is that by forcing to use the single reducer we make sure that
the ordering is done globally inside a bucket. With the patch, when there are
multiple reducers we might end up concatenating the results of different
reducers, thus losing the ordering. (This last part of the sentence is more
like a question :))
Any other ideas how to set RS traits to force the desired behavior?
Thanks,
Peter
> RS deduplication does not always enforce
> hive.optimize.reducededuplication.min.reducer
> --------------------------------------------------------------------------------------
>
> Key: HIVE-22538
> URL: https://issues.apache.org/jira/browse/HIVE-22538
> Project: Hive
> Issue Type: Bug
> Components: Physical Optimizer
> Reporter: Jesus Camacho Rodriguez
> Assignee: Jesus Camacho Rodriguez
> Priority: Major
> Labels: pull-request-available
> Attachments: HIVE-22538.patch
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> For transactional tables, that property might be overriden to 1, which can
> lead to merging final aggregation into a single stage (hence leading to
> performance degradation). For instance, when autogather column stats is
> enabled, this can happen for the following query:
> {code}
> set hive.support.concurrency=true;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> EXPLAIN
> CREATE TABLE x STORED AS ORC TBLPROPERTIES('transactional'='true') AS
> SELECT * FROM SRC x CLUSTER BY x.key;
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)