[jira] [Commented] (HIVE-26365) Remove column statistics collection task from merge statement plan

Stamatis Zampetakis (Jira) Thu, 30 Jun 2022 00:26:06 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17560876#comment-17560876
 ]


Stamatis Zampetakis commented on HIVE-26365:
--------------------------------------------

I will try to reformulate to make sure I understand the problem. [~kkasa] let 
me know if I got it right.

At the time being we are collecting column stats when we execute a MERGE 
statement but they cannot be used since they are explicitly marked invalid at 
the end of the execution. In order to save some resources we would like to drop 
automatic stat collection completely inside a MERGE statement. So all in all 
this is a performance improvement.

> Remove column statistics collection task from merge statement plan 
> -------------------------------------------------------------------
>
>                 Key: HIVE-26365
>                 URL: https://issues.apache.org/jira/browse/HIVE-26365
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Krisztian Kasa
>            Assignee: Krisztian Kasa
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Merge statements may contain delete and update branches. Update is 
> technically a delete and an insert operation. Column statistics can not be 
> calculated in case of delete operations from the changed records. Example: 
> min, max.
> Currently Hive marks the column stats of the target table invalid after 
> Update/Delete/Merge but for merge extra GBY operators and reducers are 
> generated for insert branches to calculate column stats and Stats works are 
> collecting Column stats too.
> {code}
> POSTHOOK: query: explain
> merge into acidTbl_n0 as t using nonAcidOrcTbl_n0 s ON t.a = s.a
> WHEN MATCHED AND s.a > 8 THEN DELETE
> WHEN MATCHED THEN UPDATE SET b = 7
> WHEN NOT MATCHED THEN INSERT VALUES(s.a, s.b)
> POSTHOOK: type: QUERY
> POSTHOOK: Input: default@acidtbl_n0
> POSTHOOK: Input: default@nonacidorctbl_n0
> POSTHOOK: Output: default@acidtbl_n0
> POSTHOOK: Output: default@acidtbl_n0
> POSTHOOK: Output: default@merge_tmp_table
> STAGE DEPENDENCIES:
>   Stage-5 is a root stage
>   Stage-6 depends on stages: Stage-5
>   Stage-0 depends on stages: Stage-6
>   Stage-7 depends on stages: Stage-0
>   Stage-1 depends on stages: Stage-6
>   Stage-8 depends on stages: Stage-1
>   Stage-2 depends on stages: Stage-6
>   Stage-9 depends on stages: Stage-2
>   Stage-3 depends on stages: Stage-6
>   Stage-10 depends on stages: Stage-3
>   Stage-4 depends on stages: Stage-6
>   Stage-11 depends on stages: Stage-4
> STAGE PLANS:
>   Stage: Stage-5
>     Tez
> #### A masked pattern was here ####
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 10 (SIMPLE_EDGE)
>         Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
>         Reducer 4 <- Reducer 2 (SIMPLE_EDGE)
>         Reducer 5 <- Reducer 2 (SIMPLE_EDGE)
>         Reducer 6 <- Reducer 5 (CUSTOM_SIMPLE_EDGE)
>         Reducer 7 <- Reducer 2 (SIMPLE_EDGE)
>         Reducer 8 <- Reducer 7 (CUSTOM_SIMPLE_EDGE)
>         Reducer 9 <- Reducer 2 (SIMPLE_EDGE)
> #### A masked pattern was here ####
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: s
>                   Statistics: Num rows: 4 Data size: 32 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   Select Operator
>                     expressions: a (type: int), b (type: int)
>                     outputColumnNames: _col0, _col1
>                     Statistics: Num rows: 4 Data size: 32 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Reduce Output Operator
>                       key expressions: _col0 (type: int)
>                       null sort order: z
>                       sort order: +
>                       Map-reduce partition columns: _col0 (type: int)
>                       Statistics: Num rows: 4 Data size: 32 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       value expressions: _col1 (type: int)
>             Execution mode: vectorized, llap
>             LLAP IO: all inputs
>         Map 10 
>             Map Operator Tree:
>                 TableScan
>                   alias: t
>                   filterExpr: a is not null (type: boolean)
>                   Statistics: Num rows: 2 Data size: 8 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   Filter Operator
>                     predicate: a is not null (type: boolean)
>                     Statistics: Num rows: 2 Data size: 8 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: a (type: int), ROW__ID (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 2 Data size: 160 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         null sort order: z
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 2 Data size: 160 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                         value expressions: _col1 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>             Execution mode: vectorized, llap
>             LLAP IO: may be used (ACID table)
>         Reducer 2 
>             Execution mode: llap
>             Reduce Operator Tree:
>               Merge Join Operator
>                 condition map:
>                      Left Outer Join 0 to 1
>                 keys:
>                   0 _col0 (type: int)
>                   1 _col0 (type: int)
>                 outputColumnNames: _col0, _col1, _col2, _col3
>                 Statistics: Num rows: 6 Data size: 288 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 Select Operator
>                   expressions: _col3 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>), _col1 (type: int), _col2 
> (type: int), _col0 (type: int)
>                   outputColumnNames: _col0, _col1, _col2, _col3
>                   Statistics: Num rows: 6 Data size: 288 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: ((_col2 = _col3) and (_col3 > 8)) (type: 
> boolean)
>                     Statistics: Num rows: 1 Data size: 88 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 1 Data size: 76 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                         null sort order: z
>                         sort order: +
>                         Map-reduce partition columns: UDFToInteger(_col0) 
> (type: int)
>                         Statistics: Num rows: 1 Data size: 76 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: ((_col2 = _col3) and (_col3 <= 8)) (type: 
> boolean)
>                     Statistics: Num rows: 2 Data size: 176 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 2 Data size: 152 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                         null sort order: z
>                         sort order: +
>                         Map-reduce partition columns: UDFToInteger(_col0) 
> (type: int)
>                         Statistics: Num rows: 2 Data size: 152 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: ((_col2 = _col3) and (_col3 <= 8)) (type: 
> boolean)
>                     Statistics: Num rows: 2 Data size: 176 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: _col2 (type: int), 7 (type: int)
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 2 Data size: 16 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         null sort order: a
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 2 Data size: 16 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                         value expressions: _col1 (type: int)
>                   Filter Operator
>                     predicate: _col2 is null (type: boolean)
>                     Statistics: Num rows: 4 Data size: 192 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: _col3 (type: int), _col1 (type: int)
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 4 Data size: 32 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         null sort order: a
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 4 Data size: 32 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                         value expressions: _col1 (type: int)
>                   Filter Operator
>                     predicate: (_col2 = _col3) (type: boolean)
>                     Statistics: Num rows: 3 Data size: 184 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 3 Data size: 184 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       Group By Operator
>                         aggregations: count()
>                         keys: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                         minReductionHashAggr: 0.4
>                         mode: hash
>                         outputColumnNames: _col0, _col1
>                         Statistics: Num rows: 2 Data size: 168 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                         Reduce Output Operator
>                           key expressions: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                           null sort order: z
>                           sort order: +
>                           Map-reduce partition columns: _col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                           Statistics: Num rows: 2 Data size: 168 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                           value expressions: _col1 (type: bigint)
>         Reducer 3 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: KEY.reducesinkkey0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   table:
>                       input format: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>                       serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>                       name: default.acidtbl_n0
>                   Write Type: DELETE
>         Reducer 4 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: KEY.reducesinkkey0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 2 Data size: 152 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 2 Data size: 152 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                   table:
>                       input format: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>                       serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>                       name: default.acidtbl_n0
>                   Write Type: DELETE
>         Reducer 5 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 
> (type: int)
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   table:
>                       input format: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>                       serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>                       name: default.acidtbl_n0
>                   Write Type: INSERT
>                 Select Operator
>                   expressions: _col0 (type: int), _col1 (type: int)
>                   outputColumnNames: a, b
>                   Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   Group By Operator
>                     aggregations: min(a), max(a), count(1), count(a), 
> compute_bit_vector_hll(a), min(b), max(b), count(b), compute_bit_vector_hll(b)
>                     minReductionHashAggr: 0.5
>                     mode: hash
>                     outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
> _col5, _col6, _col7, _col8
>                     Statistics: Num rows: 1 Data size: 328 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Reduce Output Operator
>                       null sort order: 
>                       sort order: 
>                       Statistics: Num rows: 1 Data size: 328 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       value expressions: _col0 (type: int), _col1 (type: 
> int), _col2 (type: bigint), _col3 (type: bigint), _col4 (type: binary), _col5 
> (type: int), _col6 (type: int), _col7 (type: bigint), _col8 (type: binary)
>         Reducer 6 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: min(VALUE._col0), max(VALUE._col1), 
> count(VALUE._col2), count(VALUE._col3), compute_bit_vector_hll(VALUE._col4), 
> min(VALUE._col5), max(VALUE._col6), count(VALUE._col7), 
> compute_bit_vector_hll(VALUE._col8)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
> _col6, _col7, _col8
>                 Statistics: Num rows: 1 Data size: 328 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 Select Operator
>                   expressions: 'LONG' (type: string), UDFToLong(_col0) (type: 
> bigint), UDFToLong(_col1) (type: bigint), (_col2 - _col3) (type: bigint), 
> COALESCE(ndv_compute_bit_vector(_col4),0) (type: bigint), _col4 (type: 
> binary), 'LONG' (type: string), UDFToLong(_col5) (type: bigint), 
> UDFToLong(_col6) (type: bigint), (_col2 - _col7) (type: bigint), 
> COALESCE(ndv_compute_bit_vector(_col8),0) (type: bigint), _col8 (type: binary)
>                   outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
> _col5, _col6, _col7, _col8, _col9, _col10, _col11
>                   Statistics: Num rows: 1 Data size: 528 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                   File Output Operator
>                     compressed: false
>                     Statistics: Num rows: 1 Data size: 528 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     table:
>                         input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                         output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                         serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>         Reducer 7 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 
> (type: int)
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 4 Data size: 32 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 4 Data size: 32 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   table:
>                       input format: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>                       serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>                       name: default.acidtbl_n0
>                   Write Type: INSERT
>                 Select Operator
>                   expressions: _col0 (type: int), _col1 (type: int)
>                   outputColumnNames: a, b
>                   Statistics: Num rows: 4 Data size: 32 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   Group By Operator
>                     aggregations: min(a), max(a), count(1), count(a), 
> compute_bit_vector_hll(a), min(b), max(b), count(b), compute_bit_vector_hll(b)
>                     minReductionHashAggr: 0.75
>                     mode: hash
>                     outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
> _col5, _col6, _col7, _col8
>                     Statistics: Num rows: 1 Data size: 328 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Reduce Output Operator
>                       null sort order: 
>                       sort order: 
>                       Statistics: Num rows: 1 Data size: 328 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       value expressions: _col0 (type: int), _col1 (type: 
> int), _col2 (type: bigint), _col3 (type: bigint), _col4 (type: binary), _col5 
> (type: int), _col6 (type: int), _col7 (type: bigint), _col8 (type: binary)
>         Reducer 8 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: min(VALUE._col0), max(VALUE._col1), 
> count(VALUE._col2), count(VALUE._col3), compute_bit_vector_hll(VALUE._col4), 
> min(VALUE._col5), max(VALUE._col6), count(VALUE._col7), 
> compute_bit_vector_hll(VALUE._col8)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
> _col6, _col7, _col8
>                 Statistics: Num rows: 1 Data size: 328 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 Select Operator
>                   expressions: 'LONG' (type: string), UDFToLong(_col0) (type: 
> bigint), UDFToLong(_col1) (type: bigint), (_col2 - _col3) (type: bigint), 
> COALESCE(ndv_compute_bit_vector(_col4),0) (type: bigint), _col4 (type: 
> binary), 'LONG' (type: string), UDFToLong(_col5) (type: bigint), 
> UDFToLong(_col6) (type: bigint), (_col2 - _col7) (type: bigint), 
> COALESCE(ndv_compute_bit_vector(_col8),0) (type: bigint), _col8 (type: binary)
>                   outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
> _col5, _col6, _col7, _col8, _col9, _col10, _col11
>                   Statistics: Num rows: 1 Data size: 528 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                   File Output Operator
>                     compressed: false
>                     Statistics: Num rows: 1 Data size: 528 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     table:
>                         input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                         output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                         serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>         Reducer 9 
>             Execution mode: llap
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 keys: KEY._col0 (type: 
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 2 Data size: 168 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 Filter Operator
>                   predicate: (_col1 > 1L) (type: boolean)
>                   Statistics: Num rows: 1 Data size: 84 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   Select Operator
>                     expressions: cardinality_violation(_col0) (type: int)
>                     outputColumnNames: _col0
>                     Statistics: Num rows: 1 Data size: 4 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     File Output Operator
>                       compressed: false
>                       Statistics: Num rows: 1 Data size: 4 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       table:
>                           input format: 
> org.apache.hadoop.mapred.TextInputFormat
>                           output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                           serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                           name: default.merge_tmp_table
>   Stage: Stage-6
>     Dependency Collection
>   Stage: Stage-0
>     Move Operator
>       tables:
>           replace: false
>           table:
>               input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>               output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>               serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>               name: default.acidtbl_n0
>           Write Type: DELETE
>   Stage: Stage-7
>     Stats Work
>       Basic Stats Work:
>   Stage: Stage-1
>     Move Operator
>       tables:
>           replace: false
>           table:
>               input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>               output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>               serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>               name: default.acidtbl_n0
>           Write Type: DELETE
>   Stage: Stage-8
>     Stats Work
>       Basic Stats Work:
>   Stage: Stage-2
>     Move Operator
>       tables:
>           replace: false
>           table:
>               input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>               output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>               serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>               name: default.acidtbl_n0
>           Write Type: INSERT
>   Stage: Stage-9
>     Stats Work
>       Basic Stats Work:
>   Stage: Stage-3
>     Move Operator
>       tables:
>           replace: false
>           table:
>               input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>               output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>               serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>               name: default.acidtbl_n0
>           Write Type: INSERT
>   Stage: Stage-10
>     Stats Work
>       Basic Stats Work:
>       Column Stats Desc:
>           Columns: a, b
>           Column Types: int, int
>           Table: default.acidtbl_n0
>   Stage: Stage-4
>     Move Operator
>       tables:
>           replace: false
>           table:
>               input format: org.apache.hadoop.mapred.TextInputFormat
>               output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>               serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>               name: default.merge_tmp_table
>   Stage: Stage-11
>     Stats Work
>       Basic Stats Work:
> {code}
> One of the insert Reducers and the follow-up Reducer for col stats collecting:
> {code}
>         Reducer 5 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 
> (type: int)
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   table:
>                       input format: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>                       serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>                       name: default.acidtbl_n0
>                   Write Type: INSERT
>                 Select Operator
>                   expressions: _col0 (type: int), _col1 (type: int)
>                   outputColumnNames: a, b
>                   Statistics: Num rows: 2 Data size: 16 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   Group By Operator
>                     aggregations: min(a), max(a), count(1), count(a), 
> compute_bit_vector_hll(a), min(b), max(b), count(b), compute_bit_vector_hll(b)
>                     minReductionHashAggr: 0.5
>                     mode: hash
>                     outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
> _col5, _col6, _col7, _col8
>                     Statistics: Num rows: 1 Data size: 328 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Reduce Output Operator
>                       null sort order: 
>                       sort order: 
>                       Statistics: Num rows: 1 Data size: 328 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       value expressions: _col0 (type: int), _col1 (type: 
> int), _col2 (type: bigint), _col3 (type: bigint), _col4 (type: binary), _col5 
> (type: int), _col6 (type: int), _col7 (type: bigint), _col8 (type: binary)
>         Reducer 6 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: min(VALUE._col0), max(VALUE._col1), 
> count(VALUE._col2), count(VALUE._col3), compute_bit_vector_hll(VALUE._col4), 
> min(VALUE._col5), max(VALUE._col6), count(VALUE._col7), 
> compute_bit_vector_hll(VALUE._col8)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
> _col6, _col7, _col8
>                 Statistics: Num rows: 1 Data size: 328 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 Select Operator
>                   expressions: 'LONG' (type: string), UDFToLong(_col0) (type: 
> bigint), UDFToLong(_col1) (type: bigint), (_col2 - _col3) (type: bigint), 
> COALESCE(ndv_compute_bit_vector(_col4),0) (type: bigint), _col4 (type: 
> binary), 'LONG' (type: string), UDFToLong(_col5) (type: bigint), 
> UDFToLong(_col6) (type: bigint), (_col2 - _col7) (type: bigint), 
> COALESCE(ndv_compute_bit_vector(_col8),0) (type: bigint), _col8 (type: binary)
>                   outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
> _col5, _col6, _col7, _col8, _col9, _col10, _col11
>                   Statistics: Num rows: 1 Data size: 528 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                   File Output Operator
>                     compressed: false
>                     Statistics: Num rows: 1 Data size: 528 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     table:
>                         input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                         output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                         serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-26365) Remove column statistics collection task from merge statement plan

Reply via email to