[
https://issues.apache.org/jira/browse/HIVE-27527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Krisztian Kasa updated HIVE-27527:
----------------------------------
Summary: Order of records is not ensured in delete delta files when reduce
deduplication is off (was: Order of records are not ensured in delete delta
files when reduce deduplication is off)
> Order of records is not ensured in delete delta files when reduce
> deduplication is off
> --------------------------------------------------------------------------------------
>
> Key: HIVE-27527
> URL: https://issues.apache.org/jira/browse/HIVE-27527
> Project: Hive
> Issue Type: Bug
> Reporter: Krisztian Kasa
> Assignee: Krisztian Kasa
> Priority: Major
>
> When
> {code}
> set hive.optimize.reducededuplication=false;
> {code}
> Reduce sink operators in delete statements are not merged. Delete delta files
> must be sorted by RowID and this is ensured by the parent Reduce sink
> operators. In this case the child Reduce sink operator has only partition key
> column: {{UDFToInteger(_col0)}} and sort order may broken and invalid delete
> delta files are written.
> {{Reduce Output Operators}} in {{Map 1}} has sort keys defined (RowId) but
> the one in {{Reducer 2}} has only Map-reduce partition columns.
> {code}
> POSTHOOK: query: explain
> delete from t1 where a = 3
> POSTHOOK: type: QUERY
> POSTHOOK: Input: default@t1
> POSTHOOK: Output: default@t1
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-2 depends on stages: Stage-1
> Stage-0 depends on stages: Stage-2
> Stage-3 depends on stages: Stage-0
> STAGE PLANS:
> Stage: Stage-1
> Tez
> #### A masked pattern was here ####
> Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)
> #### A masked pattern was here ####
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: t1
> filterExpr: (a = 3) (type: boolean)
> Statistics: Num rows: 30 Data size: 120 Basic stats:
> COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: (a = 3) (type: boolean)
> Statistics: Num rows: 1 Data size: 4 Basic stats:
> COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: ROW__ID (type:
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 76 Basic stats:
> COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> key expressions: _col0 (type:
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
> null sort order: z
> sort order: +
> Statistics: Num rows: 1 Data size: 76 Basic stats:
> COMPLETE Column stats: COMPLETE
> Execution mode: vectorized, llap
> LLAP IO: may be used (ACID table)
> Reducer 2
> Execution mode: vectorized, llap
> Reduce Operator Tree:
> Select Operator
> expressions: KEY.reducesinkkey0 (type:
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE
> Column stats: COMPLETE
> Reduce Output Operator
> null sort order:
> sort order:
> Map-reduce partition columns: UDFToInteger(_col0) (type:
> int)
> Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE
> Column stats: COMPLETE
> value expressions: _col0 (type:
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
> Reducer 3
> Execution mode: vectorized, llap
> Reduce Operator Tree:
> Select Operator
> expressions: VALUE._col0 (type:
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE
> Column stats: COMPLETE
> File Output Operator
> compressed: false
> Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE
> Column stats: COMPLETE
> table:
> input format:
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
> serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
> name: default.t1
> Write Type: DELETE
> Stage: Stage-2
> Dependency Collection
> Stage: Stage-0
> Move Operator
> tables:
> replace: false
> table:
> input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
> output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
> serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
> name: default.t1
> Write Type: DELETE
> Stage: Stage-3
> Stats Work
> Basic Stats Work:
> {code}
> Normally reduce sink deduplication optimization merges these Reduce Sink
> operators. This jira tries to cover the case when this optimization is turned
> off.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)