[ 
https://issues.apache.org/jira/browse/HIVE-27527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Kasa updated HIVE-27527:
----------------------------------
    Summary: Order of records is not ensured in delete delta files when reduce 
deduplication is off  (was: Order of records are not ensured in delete delta 
files when reduce deduplication is off)

> Order of records is not ensured in delete delta files when reduce 
> deduplication is off
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-27527
>                 URL: https://issues.apache.org/jira/browse/HIVE-27527
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Krisztian Kasa
>            Assignee: Krisztian Kasa
>            Priority: Major
>
> When 
> {code}
> set hive.optimize.reducededuplication=false;
> {code}
> Reduce sink operators in delete statements are not merged. Delete delta files 
> must be sorted by RowID and this is ensured by the parent Reduce sink 
> operators. In this case the child Reduce sink operator has only partition key 
> column: {{UDFToInteger(_col0)}} and sort order may broken and invalid delete 
> delta files are written.
> {{Reduce Output Operators}} in {{Map 1}} has sort keys defined (RowId) but 
> the one in {{Reducer 2}} has only Map-reduce partition columns.
> {code}
> POSTHOOK: query: explain
> delete from t1 where a = 3
> POSTHOOK: type: QUERY
> POSTHOOK: Input: default@t1
> POSTHOOK: Output: default@t1
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-2 depends on stages: Stage-1
>   Stage-0 depends on stages: Stage-2
>   Stage-3 depends on stages: Stage-0
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
> #### A masked pattern was here ####
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
>         Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)
> #### A masked pattern was here ####
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: t1
>                   filterExpr: (a = 3) (type: boolean)
>                   Statistics: Num rows: 30 Data size: 120 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: (a = 3) (type: boolean)
>                     Statistics: Num rows: 1 Data size: 4 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: ROW__ID (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 1 Data size: 76 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                         null sort order: z
>                         sort order: +
>                         Statistics: Num rows: 1 Data size: 76 Basic stats: 
> COMPLETE Column stats: COMPLETE
>             Execution mode: vectorized, llap
>             LLAP IO: may be used (ACID table)
>         Reducer 2 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: KEY.reducesinkkey0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 Reduce Output Operator
>                   null sort order: 
>                   sort order: 
>                   Map-reduce partition columns: UDFToInteger(_col0) (type: 
> int)
>                   Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   value expressions: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>         Reducer 3 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: VALUE._col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   table:
>                       input format: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>                       serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>                       name: default.t1
>                   Write Type: DELETE
>   Stage: Stage-2
>     Dependency Collection
>   Stage: Stage-0
>     Move Operator
>       tables:
>           replace: false
>           table:
>               input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>               output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>               serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>               name: default.t1
>           Write Type: DELETE
>   Stage: Stage-3
>     Stats Work
>       Basic Stats Work:
> {code}
> Normally reduce sink deduplication optimization merges these Reduce Sink 
> operators. This jira tries to cover the case when this optimization is turned 
> off.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to