[jira] [Commented] (HIVE-22538) RS deduplication does not always enforce hive.optimize.reducededuplication.min.reducer

Peter Vary (Jira) Wed, 27 Nov 2019 05:29:07 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-22538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983503#comment-16983503
 ]


Peter Vary commented on HIVE-22538:
-----------------------------------

[~jcamachorodriguez]: Checked one of the failures: TestTxnCommands.testDeleteIn

The problem is with the contents of the delete_delta directories.

After the patch:
{code:java}
[root@nightly7x-unsecure-1 tmp]# hive --orcfiledump -d 
[..]warehouse/acidtbl/delete_delta_0000003_0000003_0000/bucket_00000 
Processing data file file:/tmp/bucket_00000 [length: 686]
{"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":null}
{"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":3,"row":null}
________________________________________________________________________________________________________________________

[root@nightly7x-unsecure-1 tmp]# hive --orcfiledump -d 
[..]warehouse/acidtbl/delete_delta_0000003_0000003_0000/bucket_00001
Processing data file file:/tmp/bucket_00001 [length: 698]
{"operation":2,"originalTransaction":2,"bucket":536936448,"rowId":1,"currentTransaction":3,"row":null}
{"operation":2,"originalTransaction":2,"bucket":536936448,"rowId":0,"currentTransaction":3,"row":null}
________________________________________________________________________________________________________________________
 {code}
Before the patch:
{code:java}
[root@nightly7x-unsecure-1 tmp]# hive --orcfiledump -d 
[..]warehouse/acidtbl/delete_delta_0000003_0000003_0000/bucket_00000 
Processing data file file:/tmp/bucket_00000 [length: 686]
{"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":3,"row":null}
{"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":null}
________________________________________________________________________________________________________________________

[root@nightly7x-unsecure-1 tmp]# hive --orcfiledump -d 
[..]warehouse/acidtbl/delete_delta_0000003_0000003_0000/bucket_00001
Processing data file file:/tmp/bucket_00001 [length: 698]
{"operation":2,"originalTransaction":2,"bucket":536936448,"rowId":0,"currentTransaction":3,"row":null}
{"operation":2,"originalTransaction":2,"bucket":536936448,"rowId":1,"currentTransaction":3,"row":null}
________________________________________________________________________________________________________________________
 {code}
*Notice the difference of the ordering of the rows!* That is what causes the 
problem.

The table is created by this command:
{code:java}
create table acidTbl(a int, b int) clustered by (a) into 2 buckets stored as 
orc TBLPROPERTIES ('transactional'='true') {code}
My understanding is that by forcing to use the single reducer we make sure that 
the ordering is done globally inside a bucket. With the patch, when there are 
multiple reducers we might end up concatenating the results of different 
reducers, thus losing the ordering. (This last part of the sentence is more 
like a question :))

Any other ideas how to set RS traits to force the desired behavior?

Thanks,

Peter

> RS deduplication does not always enforce 
> hive.optimize.reducededuplication.min.reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-22538
>                 URL: https://issues.apache.org/jira/browse/HIVE-22538
>             Project: Hive
>          Issue Type: Bug
>          Components: Physical Optimizer
>            Reporter: Jesus Camacho Rodriguez
>            Assignee: Jesus Camacho Rodriguez
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-22538.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> For transactional tables, that property might be overriden to 1, which can 
> lead to merging final aggregation into a single stage (hence leading to 
> performance degradation). For instance, when autogather column stats is 
> enabled, this can happen for the following query:
> {code}
> set hive.support.concurrency=true;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> EXPLAIN
> CREATE TABLE x STORED AS ORC TBLPROPERTIES('transactional'='true') AS
> SELECT * FROM SRC x CLUSTER BY x.key;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22538) RS deduplication does not always enforce hive.optimize.reducededuplication.min.reducer

Reply via email to