[ 
https://issues.apache.org/jira/browse/HIVE-11394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt McCline updated HIVE-11394:
--------------------------------
    Description: 
Add detail to the EXPLAIN output showing why a Map and Reduce work is not 
vectorized.

New syntax is: EXPLAIN VECTORIZATION \[ONLY\] 
\[SUMMARY|OPERATOR|EXPRESSION|DETAIL\]

The ONLY option suppresses most non-vectorization elements.

SUMMARY shows vectorization information for the PLAN (is vectorization enabled) 
and a summary of Map and Reduce work.

OPERATOR shows vectorization information for operators.  E.g. Filter 
Vectorization.  It includes all information of SUMMARY, too.

EXPRESSION shows vectorization information for expressions.  E.g. 
predicateExpression.  It includes all information of SUMMARY and OPERATOR, too.

DETAIL shows very vectorization information.
It includes all information of SUMMARY, OPERATOR, and EXPRESSION too.

The optional clause defaults are not ONLY and SUMMARY.

---------------------------------------------------------------------------------------------------

Here are some examples:

EXPLAIN VECTORIZATION example:

(Note the PLAN VECTORIZATION, Map Vectorization, Reduce Vectorization sections)

Since SUMMARY is the default, it is the output of EXPLAIN VECTORIZATION SUMMARY.

Under Reducer 3’s "Reduce Vectorization:" you’ll see
notVectorizedReason: Aggregation Function UDF avg parameter expression for 
GROUPBY operator: Data type struct<count:bigint,sum:double,input:int> of 
Column\[VALUE._col2\] not supported

For Reducer 2’s "Reduce Vectorization:" you’ll see "groupByVectorOutput:": 
"false" which says a node has a GROUP BY with an AVG or some other aggregator 
that outputs a non-PRIMITIVE type (e.g. STRUCT) and all downstream operators 
are row-mode.  I.e. not vector output.

If "usesVectorUDFAdaptor:": "false" were true, it would say there was at least 
one vectorized expression is using VectorUDFAdaptor.

And, "allNative:": "false" will be true when all operators are native.  Today, 
GROUP BY and FILE SINK are not native.  MAP JOIN and REDUCE SINK are 
conditionally native.  FILTER and SELECT are native.

{code}
PLAN VECTORIZATION:
  enabled: true
  enabledConditionsMet: [hive.vectorized.execution.enabled IS true]

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
...
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
...
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: alltypesorc
                  Statistics: Num rows: 12288 Data size: 36696 Basic stats: 
COMPLETE Column stats: COMPLETE
                  Select Operator
                    expressions: cint (type: int)
                    outputColumnNames: cint
                    Statistics: Num rows: 12288 Data size: 36696 Basic stats: 
COMPLETE Column stats: COMPLETE
                    Group By Operator
                      keys: cint (type: int)
                      mode: hash
                      outputColumnNames: _col0
                      Statistics: Num rows: 5775 Data size: 17248 Basic stats: 
COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 5775 Data size: 17248 Basic 
stats: COMPLETE Column stats: COMPLETE
            Execution mode: vectorized, llap
            LLAP IO: all inputs
            Map Vectorization:
                enabled: true
                enabledConditionsMet: 
hive.vectorized.use.vectorized.input.format IS true
                groupByVectorOutput: true
                inputFileFormats: 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                allNative: false
                usesVectorUDFAdaptor: false
                vectorized: true
        Reducer 2 
            Execution mode: vectorized, llap
            Reduce Vectorization:
                enabled: true
                enableConditionsMet: hive.vectorized.execution.reduce.enabled 
IS true, hive.execution.engine tez IN [tez, spark] IS true
                groupByVectorOutput: false
                allNative: false
                usesVectorUDFAdaptor: false
                vectorized: true
            Reduce Operator Tree:
              Group By Operator
                keys: KEY._col0 (type: int)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 5775 Data size: 17248 Basic stats: 
COMPLETE Column stats: COMPLETE
                Group By Operator
                  aggregations: sum(_col0), count(_col0), avg(_col0), std(_col0)
                  mode: hash
                  outputColumnNames: _col0, _col1, _col2, _col3
                  Statistics: Num rows: 1 Data size: 172 Basic stats: COMPLETE 
Column stats: COMPLETE
                  Reduce Output Operator
                    sort order: 
                    Statistics: Num rows: 1 Data size: 172 Basic stats: 
COMPLETE Column stats: COMPLETE
                    value expressions: _col0 (type: bigint), _col1 (type: 
bigint), _col2 (type: struct<count:bigint,sum:double,input:int>), _col3 (type: 
struct<count:bigint,sum:double,variance:double>)
        Reducer 3 
            Execution mode: llap
            Reduce Vectorization:
                enabled: true
                enableConditionsMet: hive.vectorized.execution.reduce.enabled 
IS true, hive.execution.engine tez IN [tez, spark] IS true
                notVectorizedReason: Aggregation Function UDF avg parameter 
expression for GROUPBY operator: Data type 
struct<count:bigint,sum:double,input:int> of Column[VALUE._col2] not supported
                vectorized: false
            Reduce Operator Tree:
              Group By Operator
                aggregations: sum(VALUE._col0), count(VALUE._col1), 
avg(VALUE._col2), std(VALUE._col3)
                mode: mergepartial
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE 
Column stats: COMPLETE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE 
Column stats: COMPLETE
                  table:
                      input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink 
{code}


EXPLAIN VECTORIZATION OPERATOR

Notice the added  TableScan Vectorization, Select Vectorization, Group By 
Vectorization, Map Join Vectorizatin, Reduce Sink Vectorization sections in 
this example.

Notice the nativeConditionsMet detail on why Reduce Vectorization is native.

{code}
PLAN VECTORIZATION:
  enabled: true
  enabledConditionsMet: [hive.vectorized.execution.enabled IS true]

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
#### A masked pattern was here ####
      Edges:
        Map 2 <- Map 1 (BROADCAST_EDGE)
        Reducer 3 <- Map 2 (SIMPLE_EDGE)
#### A masked pattern was here ####
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: a
                  Statistics: Num rows: 3 Data size: 294 Basic stats: COMPLETE 
Column stats: NONE
                  TableScan Vectorization:
                      native: true
                      projectedOutputColumns: [0, 1]
                  Filter Operator
                    Filter Vectorization:
                        className: VectorFilterOperator
                        native: true
predicate: c2 is not null (type: boolean)
                    Statistics: Num rows: 3 Data size: 294 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      expressions: c1 (type: int), c2 (type: char(10))
                      outputColumnNames: _col0, _col1
                      Select Vectorization:
                          className: VectorSelectOperator
                          native: true
                          projectedOutputColumns: [0, 1]
                      Statistics: Num rows: 3 Data size: 294 Basic stats: 
COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col1 (type: char(20))
                        sort order: +
                        Map-reduce partition columns: _col1 (type: char(20))
                        Reduce Sink Vectorization:
                            className: VectorReduceSinkStringOperator
                            native: true
                            nativeConditionsMet: 
hive.vectorized.execution.reducesink.new.enabled IS true, hive.execution.engine 
tez IN [tez, spark] IS true, Not ACID UPDATE or DELETE IS true, No buckets IS 
true, No TopN IS true, Uniform Hash IS true, No DISTINCT columns IS true, 
BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true
                        Statistics: Num rows: 3 Data size: 294 Basic stats: 
COMPLETE Column stats: NONE
                        value expressions: _col0 (type: int)
            Execution mode: vectorized, llap
            LLAP IO: all inputs
            Map Vectorization:
                enabled: true
                enabledConditionsMet: 
hive.vectorized.use.vectorized.input.format IS true
                groupByVectorOutput: true
                inputFileFormats: 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                allNative: true
                usesVectorUDFAdaptor: false
                vectorized: true
        Map 2 
            Map Operator Tree:
                TableScan
                  alias: b
                  Statistics: Num rows: 3 Data size: 324 Basic stats: COMPLETE 
Column stats: NONE
                  TableScan Vectorization:
                      native: true
                      projectedOutputColumns: [0, 1]
                  Filter Operator
                    Filter Vectorization:
                        className: VectorFilterOperator
                        native: true
predicate: c2 is not null (type: boolean)
                    Statistics: Num rows: 3 Data size: 324 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      expressions: c1 (type: int), c2 (type: char(20))
                      outputColumnNames: _col0, _col1
                      Select Vectorization:
                          className: VectorSelectOperator
                          native: true
                          projectedOutputColumns: [0, 1]
                      Statistics: Num rows: 3 Data size: 324 Basic stats: 
COMPLETE Column stats: NONE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col1 (type: char(20))
                          1 _col1 (type: char(20))
                        Map Join Vectorization:
                            className: VectorMapJoinInnerStringOperator
                            native: true
                            nativeConditionsMet: 
hive.vectorized.execution.mapjoin.native.enabled IS true, hive.execution.engine 
tez IN [tez, spark] IS true, One MapJoin Condition IS true, No nullsafe IS 
true, Supports Key Types IS true, Not empty key IS true, When Fast Hash Table, 
then requires no Hybrid Hash Join IS true, Small table vectorizes IS true
                        outputColumnNames: _col0, _col1, _col2, _col3
                        input vertices:
                          0 Map 1
                        Statistics: Num rows: 3 Data size: 323 Basic stats: 
COMPLETE Column stats: NONE
                        Reduce Output Operator
                          key expressions: _col0 (type: int)
                          sort order: +
                          Reduce Sink Vectorization:
                              className: VectorReduceSinkOperator
                              native: false
                              nativeConditionsMet: 
hive.vectorized.execution.reducesink.new.enabled IS true, hive.execution.engine 
tez IN [tez, spark] IS true, Not ACID UPDATE or DELETE IS true, No buckets IS 
true, No TopN IS true, No DISTINCT columns IS true, BinarySortableSerDe for 
keys IS true, LazyBinarySerDe for values IS true
                              nativeConditionsNotMet: Uniform Hash IS false
                          Statistics: Num rows: 3 Data size: 323 Basic stats: 
COMPLETE Column stats: NONE
                          value expressions: _col1 (type: char(10)), _col2 
(type: int), _col3 (type: char(20))
            Execution mode: vectorized, llap
            LLAP IO: all inputs
            Map Vectorization:
                enabled: true
                enabledConditionsMet: 
hive.vectorized.use.vectorized.input.format IS true
                groupByVectorOutput: true
                inputFileFormats: 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                allNative: false
                usesVectorUDFAdaptor: false
                vectorized: true
        Reducer 3 
            Execution mode: vectorized, llap
            Reduce Vectorization:
                enabled: true
                enableConditionsMet: hive.vectorized.execution.reduce.enabled 
IS true, hive.execution.engine tez IN [tez, spark] IS true
                groupByVectorOutput: true
                allNative: false
                usesVectorUDFAdaptor: false
                vectorized: true
            Reduce Operator Tree:
              Select Operator
                expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: 
char(10)), VALUE._col1 (type: int), VALUE._col2 (type: char(20))
                outputColumnNames: _col0, _col1, _col2, _col3
                Select Vectorization:
                    className: VectorSelectOperator
                    native: true
                    projectedOutputColumns: [0, 1, 2, 3]
                Statistics: Num rows: 3 Data size: 323 Basic stats: COMPLETE 
Column stats: NONE
                File Output Operator
                  compressed: false
                  File Sink Vectorization:
                      className: VectorFileSinkOperator
                      native: false
                  Statistics: Num rows: 3 Data size: 323 Basic stats: COMPLETE 
Column stats: NONE
                  table:
                      input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
 {code}

EXPLAIN VECTORIZATION EXPRESSION

Notice the predicateExpression in this example.

{code}
PLAN VECTORIZATION:
  enabled: true
  enabledConditionsMet: [hive.vectorized.execution.enabled IS true]

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
#### A masked pattern was here ####
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
#### A masked pattern was here ####
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: vector_interval_2
                  Statistics: Num rows: 2 Data size: 788 Basic stats: COMPLETE 
Column stats: NONE
                  TableScan Vectorization:
                      native: true
                      projectedOutputColumns: [0, 1, 2, 3, 4, 5]
                  Filter Operator
                    Filter Vectorization:
                        className: VectorFilterOperator
                        native: true
                        predicateExpression: FilterExprAndExpr(children: 
FilterTimestampScalarEqualTimestampColumn(val 2001-01-01 01:02:03.0, col 
6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) 
-> 6:timestamp) -> boolean, FilterTimestampScalarNotEqualTimestampColumn(val 
2001-01-01 01:02:03.0, col 6)(children: DateColAddIntervalDayTimeScalar(col 1, 
val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, 
FilterTimestampScalarLessEqualTimestampColumn(val 2001-01-01 01:02:03.0, col 
6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) 
-> 6:timestamp) -> boolean, FilterTimestampScalarLessTimestampColumn(val 
2001-01-01 01:02:03.0, col 6)(children: DateColAddIntervalDayTimeScalar(col 1, 
val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, 
FilterTimestampScalarGreaterEqualTimestampColumn(val 2001-01-01 01:02:03.0, col 
6)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 
01:02:03.000000000) -> 6:timestamp) -> boolean, 
FilterTimestampScalarGreaterTimestampColumn(val 2001-01-01 01:02:03.0, col 
6)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 
01:02:04.000000000) -> 6:timestamp) -> boolean, 
FilterTimestampColEqualTimestampScalar(col 6, val 2001-01-01 
01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 
01:02:03.000000000) -> 6:timestamp) -> boolean, 
FilterTimestampColNotEqualTimestampScalar(col 6, val 2001-01-01 
01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 
01:02:04.000000000) -> 6:timestamp) -> boolean, 
FilterTimestampColGreaterEqualTimestampScalar(col 6, val 2001-01-01 
01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 
01:02:03.000000000) -> 6:timestamp) -> boolean, 
FilterTimestampColGreaterTimestampScalar(col 6, val 2001-01-01 
01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 
01:02:04.000000000) -> 6:timestamp) -> boolean, 
FilterTimestampColLessEqualTimestampScalar(col 6, val 2001-01-01 
01:02:03.0)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 
01:02:03.000000000) -> 6:timestamp) -> boolean, 
FilterTimestampColLessTimestampScalar(col 6, val 2001-01-01 
01:02:03.0)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 
01:02:04.000000000) -> 6:timestamp) -> boolean, 
FilterTimestampColEqualTimestampColumn(col 0, col 6)(children: 
DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 
6:timestamp) -> boolean, FilterTimestampColNotEqualTimestampColumn(col 0, col 
6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) 
-> 6:timestamp) -> boolean, FilterTimestampColLessEqualTimestampColumn(col 0, 
col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 
01:02:03.000000000) -> 6:timestamp) -> boolean, 
FilterTimestampColLessTimestampColumn(col 0, col 6)(children: 
DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 
6:timestamp) -> boolean, FilterTimestampColGreaterEqualTimestampColumn(col 0, 
col 6)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 
01:02:03.000000000) -> 6:timestamp) -> boolean, 
FilterTimestampColGreaterTimestampColumn(col 0, col 6)(children: 
DateColSubtractIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 
6:timestamp) -> boolean) -> boolean
                    predicate: ((2001-01-01 01:02:03.0 = (dt + 0 
01:02:03.000000000)) and (2001-01-01 01:02:03.0 <> (dt + 0 01:02:04.000000000)) 
and (2001-01-01 01:02:03.0 <= (dt + 0 01:02:03.000000000)) and (2001-01-01 
01:02:03.0 < (dt + 0 01:02:04.000000000)) and (2001-01-01 01:02:03.0 >= (dt - 0 
01:02:03.000000000)) and (2001-01-01 01:02:03.0 > (dt - 0 01:02:04.000000000)) 
and ((dt + 0 01:02:03.000000000) = 2001-01-01 01:02:03.0) and ((dt + 0 
01:02:04.000000000) <> 2001-01-01 01:02:03.0) and ((dt + 0 01:02:03.000000000) 
>= 2001-01-01 01:02:03.0) and ((dt + 0 01:02:04.000000000) > 2001-01-01 
01:02:03.0) and ((dt - 0 01:02:03.000000000) <= 2001-01-01 01:02:03.0) and ((dt 
- 0 01:02:04.000000000) < 2001-01-01 01:02:03.0) and (ts = (dt + 0 
01:02:03.000000000)) and (ts <> (dt + 0 01:02:04.000000000)) and (ts <= (dt + 0 
01:02:03.000000000)) and (ts < (dt + 0 01:02:04.000000000)) and (ts >= (dt - 0 
01:02:03.000000000)) and (ts > (dt - 0 01:02:04.000000000))) (type: boolean)
                    Statistics: Num rows: 1 Data size: 394 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      expressions: ts (type: timestamp)
                      outputColumnNames: _col0
                      Select Vectorization:
                          className: VectorSelectOperator
                          native: true
                          projectedOutputColumns: [0]
                      Statistics: Num rows: 1 Data size: 394 Basic stats: 
COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: timestamp)
                        sort order: +
                        Reduce Sink Vectorization:
                            className: VectorReduceSinkOperator
                            native: false
                            nativeConditionsMet: 
hive.vectorized.execution.reducesink.new.enabled IS true, hive.execution.engine 
tez IN [tez, spark] IS true, Not ACID UPDATE or DELETE IS true, No buckets IS 
true, No TopN IS true, No DISTINCT columns IS true, BinarySortableSerDe for 
keys IS true, LazyBinarySerDe for values IS true
                            nativeConditionsNotMet: Uniform Hash IS false
                        Statistics: Num rows: 1 Data size: 394 Basic stats: 
COMPLETE Column stats: NONE
            Execution mode: vectorized, llap
            LLAP IO: all inputs
            Map Vectorization:
                enabled: true
                enabledConditionsMet: 
hive.vectorized.use.vectorized.input.format IS true
                groupByVectorOutput: true
                inputFileFormats: 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                allNative: false
                usesVectorUDFAdaptor: false
                vectorized: true
        Reducer 2 
... 
{code}


The standard @Explain Annotation Type is used.  A new 'vectorization' 
annotation marks each new class and method.

Works for FORMATTED, like other non-vectorization EXPLAIN variations.


  was:
Add detail to the EXPLAIN output showing why a Map and Reduce work is not 
vectorized.

New syntax is: EXPLAIN VECTORIZATION \[ONLY\] 
\[SUMMARY|OPERATOR|EXPRESSION|DETAIL\]

The ONLY option suppresses most non-vectorization elements.

SUMMARY shows vectorization information for the PLAN (is vectorization enabled) 
and a summary of Map and Reduce work.

OPERATOR shows vectorization information for operators.  E.g. Filter 
Vectorization.  It includes all information of SUMMARY, too.

EXPRESSION shows vectorization information for expressions.  E.g. 
predicateExpression.  It includes all information of SUMMARY and OPERATOR, too.

DETAIL shows very vectorization information.
It includes all information of SUMMARY, OPERATOR, and EXPRESSION too.

The optional clause defaults are not ONLY and SUMMARY.

Here are some examples:

EXPLAIN VECTORIZATION example:

(Note the PLAN VECTORIZATION, Map Vectorization, Reduce Vectorization sections)

Since SUMMARY is the default, it is the output of EXPLAIN VECTORIZATION SUMMARY.

{code}
coming soon…
{code}


EXPLAIN VECTORIZATION OPERATOR

Notice the added  Select Vectorization, Group By Vectorization, Reduce Sink 
Vectorization sections in this example.

{code}
coming soon…
{code}

EXPLAIN VECTORIZATION EXPRESSION

Notice the aaaaa in this example.

{code}
coming soon…
{code}


EXPLAIN VECTORIZATION DETAIL

Notice the aaaaa in this example.

{code}
coming soon…
{code}


EXPLAIN VECTORIZATION ONLY example:

{code}
coming soon…
{code}

EXPLAIN VECTORIZATION ONLY OPERATOR example:

{code}
coming soon…
{code}

EXPLAIN VECTORIZATION ONLY EXPRESSION example:

{code}
{code}

EXPLAIN VECTORIZATION ONLY DETAIL example:

{code}
coming soon…
{code}


The standard @Explain Annotation Type is used.  A new 'vectorization' 
annotation marks each new class and method.

Works for FORMATTED, like other non-vectorization EXPLAIN variations.

EXPLAIN VECTORIZATION FORMATTED example:

{code}
coming soon…
{code}

or pretty printed:

{code}
coming soon…
{code}



> Enhance EXPLAIN display for vectorization
> -----------------------------------------
>
>                 Key: HIVE-11394
>                 URL: https://issues.apache.org/jira/browse/HIVE-11394
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>            Reporter: Matt McCline
>            Assignee: Matt McCline
>            Priority: Critical
>         Attachments: HIVE-11394.01.patch, HIVE-11394.02.patch, 
> HIVE-11394.03.patch, HIVE-11394.04.patch, HIVE-11394.05.patch, 
> HIVE-11394.06.patch, HIVE-11394.07.patch, HIVE-11394.08.patch, 
> HIVE-11394.09.patch
>
>
> Add detail to the EXPLAIN output showing why a Map and Reduce work is not 
> vectorized.
> New syntax is: EXPLAIN VECTORIZATION \[ONLY\] 
> \[SUMMARY|OPERATOR|EXPRESSION|DETAIL\]
> The ONLY option suppresses most non-vectorization elements.
> SUMMARY shows vectorization information for the PLAN (is vectorization 
> enabled) and a summary of Map and Reduce work.
> OPERATOR shows vectorization information for operators.  E.g. Filter 
> Vectorization.  It includes all information of SUMMARY, too.
> EXPRESSION shows vectorization information for expressions.  E.g. 
> predicateExpression.  It includes all information of SUMMARY and OPERATOR, 
> too.
> DETAIL shows very vectorization information.
> It includes all information of SUMMARY, OPERATOR, and EXPRESSION too.
> The optional clause defaults are not ONLY and SUMMARY.
> ---------------------------------------------------------------------------------------------------
> Here are some examples:
> EXPLAIN VECTORIZATION example:
> (Note the PLAN VECTORIZATION, Map Vectorization, Reduce Vectorization 
> sections)
> Since SUMMARY is the default, it is the output of EXPLAIN VECTORIZATION 
> SUMMARY.
> Under Reducer 3’s "Reduce Vectorization:" you’ll see
> notVectorizedReason: Aggregation Function UDF avg parameter expression for 
> GROUPBY operator: Data type struct<count:bigint,sum:double,input:int> of 
> Column\[VALUE._col2\] not supported
> For Reducer 2’s "Reduce Vectorization:" you’ll see "groupByVectorOutput:": 
> "false" which says a node has a GROUP BY with an AVG or some other aggregator 
> that outputs a non-PRIMITIVE type (e.g. STRUCT) and all downstream operators 
> are row-mode.  I.e. not vector output.
> If "usesVectorUDFAdaptor:": "false" were true, it would say there was at 
> least one vectorized expression is using VectorUDFAdaptor.
> And, "allNative:": "false" will be true when all operators are native.  
> Today, GROUP BY and FILE SINK are not native.  MAP JOIN and REDUCE SINK are 
> conditionally native.  FILTER and SELECT are native.
> {code}
> PLAN VECTORIZATION:
>   enabled: true
>   enabledConditionsMet: [hive.vectorized.execution.enabled IS true]
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
> ...
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
>         Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
> ...
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: alltypesorc
>                   Statistics: Num rows: 12288 Data size: 36696 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                   Select Operator
>                     expressions: cint (type: int)
>                     outputColumnNames: cint
>                     Statistics: Num rows: 12288 Data size: 36696 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     Group By Operator
>                       keys: cint (type: int)
>                       mode: hash
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 5775 Data size: 17248 Basic 
> stats: COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 5775 Data size: 17248 Basic 
> stats: COMPLETE Column stats: COMPLETE
>             Execution mode: vectorized, llap
>             LLAP IO: all inputs
>             Map Vectorization:
>                 enabled: true
>                 enabledConditionsMet: 
> hive.vectorized.use.vectorized.input.format IS true
>                 groupByVectorOutput: true
>                 inputFileFormats: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>                 allNative: false
>                 usesVectorUDFAdaptor: false
>                 vectorized: true
>         Reducer 2 
>             Execution mode: vectorized, llap
>             Reduce Vectorization:
>                 enabled: true
>                 enableConditionsMet: hive.vectorized.execution.reduce.enabled 
> IS true, hive.execution.engine tez IN [tez, spark] IS true
>                 groupByVectorOutput: false
>                 allNative: false
>                 usesVectorUDFAdaptor: false
>                 vectorized: true
>             Reduce Operator Tree:
>               Group By Operator
>                 keys: KEY._col0 (type: int)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 5775 Data size: 17248 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                 Group By Operator
>                   aggregations: sum(_col0), count(_col0), avg(_col0), 
> std(_col0)
>                   mode: hash
>                   outputColumnNames: _col0, _col1, _col2, _col3
>                   Statistics: Num rows: 1 Data size: 172 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                   Reduce Output Operator
>                     sort order: 
>                     Statistics: Num rows: 1 Data size: 172 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                     value expressions: _col0 (type: bigint), _col1 (type: 
> bigint), _col2 (type: struct<count:bigint,sum:double,input:int>), _col3 
> (type: struct<count:bigint,sum:double,variance:double>)
>         Reducer 3 
>             Execution mode: llap
>             Reduce Vectorization:
>                 enabled: true
>                 enableConditionsMet: hive.vectorized.execution.reduce.enabled 
> IS true, hive.execution.engine tez IN [tez, spark] IS true
>                 notVectorizedReason: Aggregation Function UDF avg parameter 
> expression for GROUPBY operator: Data type 
> struct<count:bigint,sum:double,input:int> of Column[VALUE._col2] not supported
>                 vectorized: false
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: sum(VALUE._col0), count(VALUE._col1), 
> avg(VALUE._col2), std(VALUE._col3)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1, _col2, _col3
>                 Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   table:
>                       input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                       serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink 
> {code}
> EXPLAIN VECTORIZATION OPERATOR
> Notice the added  TableScan Vectorization, Select Vectorization, Group By 
> Vectorization, Map Join Vectorizatin, Reduce Sink Vectorization sections in 
> this example.
> Notice the nativeConditionsMet detail on why Reduce Vectorization is native.
> {code}
> PLAN VECTORIZATION:
>   enabled: true
>   enabledConditionsMet: [hive.vectorized.execution.enabled IS true]
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
> #### A masked pattern was here ####
>       Edges:
>         Map 2 <- Map 1 (BROADCAST_EDGE)
>         Reducer 3 <- Map 2 (SIMPLE_EDGE)
> #### A masked pattern was here ####
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: a
>                   Statistics: Num rows: 3 Data size: 294 Basic stats: 
> COMPLETE Column stats: NONE
>                   TableScan Vectorization:
>                       native: true
>                       projectedOutputColumns: [0, 1]
>                   Filter Operator
>                     Filter Vectorization:
>                         className: VectorFilterOperator
>                         native: true
> predicate: c2 is not null (type: boolean)
>                     Statistics: Num rows: 3 Data size: 294 Basic stats: 
> COMPLETE Column stats: NONE
>                     Select Operator
>                       expressions: c1 (type: int), c2 (type: char(10))
>                       outputColumnNames: _col0, _col1
>                       Select Vectorization:
>                           className: VectorSelectOperator
>                           native: true
>                           projectedOutputColumns: [0, 1]
>                       Statistics: Num rows: 3 Data size: 294 Basic stats: 
> COMPLETE Column stats: NONE
>                       Reduce Output Operator
>                         key expressions: _col1 (type: char(20))
>                         sort order: +
>                         Map-reduce partition columns: _col1 (type: char(20))
>                         Reduce Sink Vectorization:
>                             className: VectorReduceSinkStringOperator
>                             native: true
>                             nativeConditionsMet: 
> hive.vectorized.execution.reducesink.new.enabled IS true, 
> hive.execution.engine tez IN [tez, spark] IS true, Not ACID UPDATE or DELETE 
> IS true, No buckets IS true, No TopN IS true, Uniform Hash IS true, No 
> DISTINCT columns IS true, BinarySortableSerDe for keys IS true, 
> LazyBinarySerDe for values IS true
>                         Statistics: Num rows: 3 Data size: 294 Basic stats: 
> COMPLETE Column stats: NONE
>                         value expressions: _col0 (type: int)
>             Execution mode: vectorized, llap
>             LLAP IO: all inputs
>             Map Vectorization:
>                 enabled: true
>                 enabledConditionsMet: 
> hive.vectorized.use.vectorized.input.format IS true
>                 groupByVectorOutput: true
>                 inputFileFormats: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>                 allNative: true
>                 usesVectorUDFAdaptor: false
>                 vectorized: true
>         Map 2 
>             Map Operator Tree:
>                 TableScan
>                   alias: b
>                   Statistics: Num rows: 3 Data size: 324 Basic stats: 
> COMPLETE Column stats: NONE
>                   TableScan Vectorization:
>                       native: true
>                       projectedOutputColumns: [0, 1]
>                   Filter Operator
>                     Filter Vectorization:
>                         className: VectorFilterOperator
>                         native: true
> predicate: c2 is not null (type: boolean)
>                     Statistics: Num rows: 3 Data size: 324 Basic stats: 
> COMPLETE Column stats: NONE
>                     Select Operator
>                       expressions: c1 (type: int), c2 (type: char(20))
>                       outputColumnNames: _col0, _col1
>                       Select Vectorization:
>                           className: VectorSelectOperator
>                           native: true
>                           projectedOutputColumns: [0, 1]
>                       Statistics: Num rows: 3 Data size: 324 Basic stats: 
> COMPLETE Column stats: NONE
>                       Map Join Operator
>                         condition map:
>                              Inner Join 0 to 1
>                         keys:
>                           0 _col1 (type: char(20))
>                           1 _col1 (type: char(20))
>                         Map Join Vectorization:
>                             className: VectorMapJoinInnerStringOperator
>                             native: true
>                             nativeConditionsMet: 
> hive.vectorized.execution.mapjoin.native.enabled IS true, 
> hive.execution.engine tez IN [tez, spark] IS true, One MapJoin Condition IS 
> true, No nullsafe IS true, Supports Key Types IS true, Not empty key IS true, 
> When Fast Hash Table, then requires no Hybrid Hash Join IS true, Small table 
> vectorizes IS true
>                         outputColumnNames: _col0, _col1, _col2, _col3
>                         input vertices:
>                           0 Map 1
>                         Statistics: Num rows: 3 Data size: 323 Basic stats: 
> COMPLETE Column stats: NONE
>                         Reduce Output Operator
>                           key expressions: _col0 (type: int)
>                           sort order: +
>                           Reduce Sink Vectorization:
>                               className: VectorReduceSinkOperator
>                               native: false
>                               nativeConditionsMet: 
> hive.vectorized.execution.reducesink.new.enabled IS true, 
> hive.execution.engine tez IN [tez, spark] IS true, Not ACID UPDATE or DELETE 
> IS true, No buckets IS true, No TopN IS true, No DISTINCT columns IS true, 
> BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true
>                               nativeConditionsNotMet: Uniform Hash IS false
>                           Statistics: Num rows: 3 Data size: 323 Basic stats: 
> COMPLETE Column stats: NONE
>                           value expressions: _col1 (type: char(10)), _col2 
> (type: int), _col3 (type: char(20))
>             Execution mode: vectorized, llap
>             LLAP IO: all inputs
>             Map Vectorization:
>                 enabled: true
>                 enabledConditionsMet: 
> hive.vectorized.use.vectorized.input.format IS true
>                 groupByVectorOutput: true
>                 inputFileFormats: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>                 allNative: false
>                 usesVectorUDFAdaptor: false
>                 vectorized: true
>         Reducer 3 
>             Execution mode: vectorized, llap
>             Reduce Vectorization:
>                 enabled: true
>                 enableConditionsMet: hive.vectorized.execution.reduce.enabled 
> IS true, hive.execution.engine tez IN [tez, spark] IS true
>                 groupByVectorOutput: true
>                 allNative: false
>                 usesVectorUDFAdaptor: false
>                 vectorized: true
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 
> (type: char(10)), VALUE._col1 (type: int), VALUE._col2 (type: char(20))
>                 outputColumnNames: _col0, _col1, _col2, _col3
>                 Select Vectorization:
>                     className: VectorSelectOperator
>                     native: true
>                     projectedOutputColumns: [0, 1, 2, 3]
>                 Statistics: Num rows: 3 Data size: 323 Basic stats: COMPLETE 
> Column stats: NONE
>                 File Output Operator
>                   compressed: false
>                   File Sink Vectorization:
>                       className: VectorFileSinkOperator
>                       native: false
>                   Statistics: Num rows: 3 Data size: 323 Basic stats: 
> COMPLETE Column stats: NONE
>                   table:
>                       input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                       serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
>  {code}
> EXPLAIN VECTORIZATION EXPRESSION
> Notice the predicateExpression in this example.
> {code}
> PLAN VECTORIZATION:
>   enabled: true
>   enabledConditionsMet: [hive.vectorized.execution.enabled IS true]
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
> #### A masked pattern was here ####
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
> #### A masked pattern was here ####
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: vector_interval_2
>                   Statistics: Num rows: 2 Data size: 788 Basic stats: 
> COMPLETE Column stats: NONE
>                   TableScan Vectorization:
>                       native: true
>                       projectedOutputColumns: [0, 1, 2, 3, 4, 5]
>                   Filter Operator
>                     Filter Vectorization:
>                         className: VectorFilterOperator
>                         native: true
>                         predicateExpression: FilterExprAndExpr(children: 
> FilterTimestampScalarEqualTimestampColumn(val 2001-01-01 01:02:03.0, col 
> 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) 
> -> 6:timestamp) -> boolean, FilterTimestampScalarNotEqualTimestampColumn(val 
> 2001-01-01 01:02:03.0, col 6)(children: DateColAddIntervalDayTimeScalar(col 
> 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, 
> FilterTimestampScalarLessEqualTimestampColumn(val 2001-01-01 01:02:03.0, col 
> 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) 
> -> 6:timestamp) -> boolean, FilterTimestampScalarLessTimestampColumn(val 
> 2001-01-01 01:02:03.0, col 6)(children: DateColAddIntervalDayTimeScalar(col 
> 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean, 
> FilterTimestampScalarGreaterEqualTimestampColumn(val 2001-01-01 01:02:03.0, 
> col 6)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 
> 01:02:03.000000000) -> 6:timestamp) -> boolean, 
> FilterTimestampScalarGreaterTimestampColumn(val 2001-01-01 01:02:03.0, col 
> 6)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 
> 01:02:04.000000000) -> 6:timestamp) -> boolean, 
> FilterTimestampColEqualTimestampScalar(col 6, val 2001-01-01 
> 01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 
> 01:02:03.000000000) -> 6:timestamp) -> boolean, 
> FilterTimestampColNotEqualTimestampScalar(col 6, val 2001-01-01 
> 01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 
> 01:02:04.000000000) -> 6:timestamp) -> boolean, 
> FilterTimestampColGreaterEqualTimestampScalar(col 6, val 2001-01-01 
> 01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 
> 01:02:03.000000000) -> 6:timestamp) -> boolean, 
> FilterTimestampColGreaterTimestampScalar(col 6, val 2001-01-01 
> 01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 
> 01:02:04.000000000) -> 6:timestamp) -> boolean, 
> FilterTimestampColLessEqualTimestampScalar(col 6, val 2001-01-01 
> 01:02:03.0)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 
> 01:02:03.000000000) -> 6:timestamp) -> boolean, 
> FilterTimestampColLessTimestampScalar(col 6, val 2001-01-01 
> 01:02:03.0)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 
> 01:02:04.000000000) -> 6:timestamp) -> boolean, 
> FilterTimestampColEqualTimestampColumn(col 0, col 6)(children: 
> DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) -> 
> 6:timestamp) -> boolean, FilterTimestampColNotEqualTimestampColumn(col 0, col 
> 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) 
> -> 6:timestamp) -> boolean, FilterTimestampColLessEqualTimestampColumn(col 0, 
> col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 
> 01:02:03.000000000) -> 6:timestamp) -> boolean, 
> FilterTimestampColLessTimestampColumn(col 0, col 6)(children: 
> DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 
> 6:timestamp) -> boolean, FilterTimestampColGreaterEqualTimestampColumn(col 0, 
> col 6)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0 
> 01:02:03.000000000) -> 6:timestamp) -> boolean, 
> FilterTimestampColGreaterTimestampColumn(col 0, col 6)(children: 
> DateColSubtractIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) -> 
> 6:timestamp) -> boolean) -> boolean
>                     predicate: ((2001-01-01 01:02:03.0 = (dt + 0 
> 01:02:03.000000000)) and (2001-01-01 01:02:03.0 <> (dt + 0 
> 01:02:04.000000000)) and (2001-01-01 01:02:03.0 <= (dt + 0 
> 01:02:03.000000000)) and (2001-01-01 01:02:03.0 < (dt + 0 
> 01:02:04.000000000)) and (2001-01-01 01:02:03.0 >= (dt - 0 
> 01:02:03.000000000)) and (2001-01-01 01:02:03.0 > (dt - 0 
> 01:02:04.000000000)) and ((dt + 0 01:02:03.000000000) = 2001-01-01 
> 01:02:03.0) and ((dt + 0 01:02:04.000000000) <> 2001-01-01 01:02:03.0) and 
> ((dt + 0 01:02:03.000000000) >= 2001-01-01 01:02:03.0) and ((dt + 0 
> 01:02:04.000000000) > 2001-01-01 01:02:03.0) and ((dt - 0 01:02:03.000000000) 
> <= 2001-01-01 01:02:03.0) and ((dt - 0 01:02:04.000000000) < 2001-01-01 
> 01:02:03.0) and (ts = (dt + 0 01:02:03.000000000)) and (ts <> (dt + 0 
> 01:02:04.000000000)) and (ts <= (dt + 0 01:02:03.000000000)) and (ts < (dt + 
> 0 01:02:04.000000000)) and (ts >= (dt - 0 01:02:03.000000000)) and (ts > (dt 
> - 0 01:02:04.000000000))) (type: boolean)
>                     Statistics: Num rows: 1 Data size: 394 Basic stats: 
> COMPLETE Column stats: NONE
>                     Select Operator
>                       expressions: ts (type: timestamp)
>                       outputColumnNames: _col0
>                       Select Vectorization:
>                           className: VectorSelectOperator
>                           native: true
>                           projectedOutputColumns: [0]
>                       Statistics: Num rows: 1 Data size: 394 Basic stats: 
> COMPLETE Column stats: NONE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: timestamp)
>                         sort order: +
>                         Reduce Sink Vectorization:
>                             className: VectorReduceSinkOperator
>                             native: false
>                             nativeConditionsMet: 
> hive.vectorized.execution.reducesink.new.enabled IS true, 
> hive.execution.engine tez IN [tez, spark] IS true, Not ACID UPDATE or DELETE 
> IS true, No buckets IS true, No TopN IS true, No DISTINCT columns IS true, 
> BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true
>                             nativeConditionsNotMet: Uniform Hash IS false
>                         Statistics: Num rows: 1 Data size: 394 Basic stats: 
> COMPLETE Column stats: NONE
>             Execution mode: vectorized, llap
>             LLAP IO: all inputs
>             Map Vectorization:
>                 enabled: true
>                 enabledConditionsMet: 
> hive.vectorized.use.vectorized.input.format IS true
>                 groupByVectorOutput: true
>                 inputFileFormats: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>                 allNative: false
>                 usesVectorUDFAdaptor: false
>                 vectorized: true
>         Reducer 2 
> ... 
> {code}
> The standard @Explain Annotation Type is used.  A new 'vectorization' 
> annotation marks each new class and method.
> Works for FORMATTED, like other non-vectorization EXPLAIN variations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to