[
https://issues.apache.org/jira/browse/HIVE-11394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matt McCline updated HIVE-11394:
--------------------------------
Attachment: HIVE-11394.094.patch
Warm this patch back up.
> Enhance EXPLAIN display for vectorization
> -----------------------------------------
>
> Key: HIVE-11394
> URL: https://issues.apache.org/jira/browse/HIVE-11394
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Reporter: Matt McCline
> Assignee: Matt McCline
> Priority: Critical
> Fix For: 2.2.0
>
> Attachments: HIVE-11394.01.patch, HIVE-11394.02.patch,
> HIVE-11394.03.patch, HIVE-11394.04.patch, HIVE-11394.05.patch,
> HIVE-11394.06.patch, HIVE-11394.07.patch, HIVE-11394.08.patch,
> HIVE-11394.09.patch, HIVE-11394.091.patch, HIVE-11394.092.patch,
> HIVE-11394.093.patch, HIVE-11394.094.patch
>
>
> Add detail to the EXPLAIN output showing why a Map and Reduce work is not
> vectorized.
> New syntax is: EXPLAIN VECTORIZATION \[ONLY\]
> \[SUMMARY|OPERATOR|EXPRESSION|DETAIL\]
> The ONLY option suppresses most non-vectorization elements.
> SUMMARY shows vectorization information for the PLAN (is vectorization
> enabled) and a summary of Map and Reduce work.
> OPERATOR shows vectorization information for operators. E.g. Filter
> Vectorization. It includes all information of SUMMARY, too.
> EXPRESSION shows vectorization information for expressions. E.g.
> predicateExpression. It includes all information of SUMMARY and OPERATOR,
> too.
> DETAIL shows very vectorization information.
> It includes all information of SUMMARY, OPERATOR, and EXPRESSION too.
> The optional clause defaults are not ONLY and SUMMARY.
> ---------------------------------------------------------------------------------------------------
> Here are some examples:
> EXPLAIN VECTORIZATION example:
> (Note the PLAN VECTORIZATION, Map Vectorization, Reduce Vectorization
> sections)
> Since SUMMARY is the default, it is the output of EXPLAIN VECTORIZATION
> SUMMARY.
> Under Reducer 3’s "Reduce Vectorization:" you’ll see
> notVectorizedReason: Aggregation Function UDF avg parameter expression for
> GROUPBY operator: Data type struct<count:bigint,sum:double,input:int> of
> Column\[VALUE._col2\] not supported
> For Reducer 2’s "Reduce Vectorization:" you’ll see "groupByVectorOutput:":
> "false" which says a node has a GROUP BY with an AVG or some other aggregator
> that outputs a non-PRIMITIVE type (e.g. STRUCT) and all downstream operators
> are row-mode. I.e. not vector output.
> If "usesVectorUDFAdaptor:": "false" were true, it would say there was at
> least one vectorized expression is using VectorUDFAdaptor.
> And, "allNative:": "false" will be true when all operators are native.
> Today, GROUP BY and FILE SINK are not native. MAP JOIN and REDUCE SINK are
> conditionally native. FILTER and SELECT are native.
> {code}
> PLAN VECTORIZATION:
> enabled: true
> enabledConditionsMet: [hive.vectorized.execution.enabled IS true]
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
> STAGE PLANS:
> Stage: Stage-1
> Tez
> ...
> Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
> ...
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: alltypesorc
> Statistics: Num rows: 12288 Data size: 36696 Basic stats:
> COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: cint (type: int)
> outputColumnNames: cint
> Statistics: Num rows: 12288 Data size: 36696 Basic stats:
> COMPLETE Column stats: COMPLETE
> Group By Operator
> keys: cint (type: int)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 5775 Data size: 17248 Basic
> stats: COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> key expressions: _col0 (type: int)
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 5775 Data size: 17248 Basic
> stats: COMPLETE Column stats: COMPLETE
> Execution mode: vectorized, llap
> LLAP IO: all inputs
> Map Vectorization:
> enabled: true
> enabledConditionsMet:
> hive.vectorized.use.vectorized.input.format IS true
> groupByVectorOutput: true
> inputFileFormats:
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
> allNative: false
> usesVectorUDFAdaptor: false
> vectorized: true
> Reducer 2
> Execution mode: vectorized, llap
> Reduce Vectorization:
> enabled: true
> enableConditionsMet: hive.vectorized.execution.reduce.enabled
> IS true, hive.execution.engine tez IN [tez, spark] IS true
> groupByVectorOutput: false
> allNative: false
> usesVectorUDFAdaptor: false
> vectorized: true
> Reduce Operator Tree:
> Group By Operator
> keys: KEY._col0 (type: int)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 5775 Data size: 17248 Basic stats:
> COMPLETE Column stats: COMPLETE
> Group By Operator
> aggregations: sum(_col0), count(_col0), avg(_col0),
> std(_col0)
> mode: hash
> outputColumnNames: _col0, _col1, _col2, _col3
> Statistics: Num rows: 1 Data size: 172 Basic stats:
> COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 172 Basic stats:
> COMPLETE Column stats: COMPLETE
> value expressions: _col0 (type: bigint), _col1 (type:
> bigint), _col2 (type: struct<count:bigint,sum:double,input:int>), _col3
> (type: struct<count:bigint,sum:double,variance:double>)
> Reducer 3
> Execution mode: llap
> Reduce Vectorization:
> enabled: true
> enableConditionsMet: hive.vectorized.execution.reduce.enabled
> IS true, hive.execution.engine tez IN [tez, spark] IS true
> notVectorizedReason: Aggregation Function UDF avg parameter
> expression for GROUPBY operator: Data type
> struct<count:bigint,sum:double,input:int> of Column[VALUE._col2] not supported
> vectorized: false
> Reduce Operator Tree:
> Group By Operator
> aggregations: sum(VALUE._col0), count(VALUE._col1),
> avg(VALUE._col2), std(VALUE._col3)
> mode: mergepartial
> outputColumnNames: _col0, _col1, _col2, _col3
> Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE
> Column stats: COMPLETE
> File Output Operator
> compressed: false
> Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE
> Column stats: COMPLETE
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
> {code}
> EXPLAIN VECTORIZATION OPERATOR
> Notice the added TableScan Vectorization, Select Vectorization, Group By
> Vectorization, Map Join Vectorizatin, Reduce Sink Vectorization sections in
> this example.
> Notice the nativeConditionsMet detail on why Reduce Vectorization is native.
> {code}
> PLAN VECTORIZATION:
> enabled: true
> enabledConditionsMet: [hive.vectorized.execution.enabled IS true]
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
> STAGE PLANS:
> Stage: Stage-1
> Tez
> #### A masked pattern was here ####
> Edges:
> Map 2 <- Map 1 (BROADCAST_EDGE)
> Reducer 3 <- Map 2 (SIMPLE_EDGE)
> #### A masked pattern was here ####
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: a
> Statistics: Num rows: 3 Data size: 294 Basic stats:
> COMPLETE Column stats: NONE
> TableScan Vectorization:
> native: true
> projectedOutputColumns: [0, 1]
> Filter Operator
> Filter Vectorization:
> className: VectorFilterOperator
> native: true
> predicate: c2 is not null (type: boolean)
> Statistics: Num rows: 3 Data size: 294 Basic stats:
> COMPLETE Column stats: NONE
> Select Operator
> expressions: c1 (type: int), c2 (type: char(10))
> outputColumnNames: _col0, _col1
> Select Vectorization:
> className: VectorSelectOperator
> native: true
> projectedOutputColumns: [0, 1]
> Statistics: Num rows: 3 Data size: 294 Basic stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col1 (type: char(20))
> sort order: +
> Map-reduce partition columns: _col1 (type: char(20))
> Reduce Sink Vectorization:
> className: VectorReduceSinkStringOperator
> native: true
> nativeConditionsMet:
> hive.vectorized.execution.reducesink.new.enabled IS true,
> hive.execution.engine tez IN [tez, spark] IS true, Not ACID UPDATE or DELETE
> IS true, No buckets IS true, No TopN IS true, Uniform Hash IS true, No
> DISTINCT columns IS true, BinarySortableSerDe for keys IS true,
> LazyBinarySerDe for values IS true
> Statistics: Num rows: 3 Data size: 294 Basic stats:
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: int)
> Execution mode: vectorized, llap
> LLAP IO: all inputs
> Map Vectorization:
> enabled: true
> enabledConditionsMet:
> hive.vectorized.use.vectorized.input.format IS true
> groupByVectorOutput: true
> inputFileFormats:
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
> allNative: true
> usesVectorUDFAdaptor: false
> vectorized: true
> Map 2
> Map Operator Tree:
> TableScan
> alias: b
> Statistics: Num rows: 3 Data size: 324 Basic stats:
> COMPLETE Column stats: NONE
> TableScan Vectorization:
> native: true
> projectedOutputColumns: [0, 1]
> Filter Operator
> Filter Vectorization:
> className: VectorFilterOperator
> native: true
> predicate: c2 is not null (type: boolean)
> Statistics: Num rows: 3 Data size: 324 Basic stats:
> COMPLETE Column stats: NONE
> Select Operator
> expressions: c1 (type: int), c2 (type: char(20))
> outputColumnNames: _col0, _col1
> Select Vectorization:
> className: VectorSelectOperator
> native: true
> projectedOutputColumns: [0, 1]
> Statistics: Num rows: 3 Data size: 324 Basic stats:
> COMPLETE Column stats: NONE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> keys:
> 0 _col1 (type: char(20))
> 1 _col1 (type: char(20))
> Map Join Vectorization:
> className: VectorMapJoinInnerStringOperator
> native: true
> nativeConditionsMet:
> hive.vectorized.execution.mapjoin.native.enabled IS true,
> hive.execution.engine tez IN [tez, spark] IS true, One MapJoin Condition IS
> true, No nullsafe IS true, Supports Key Types IS true, Not empty key IS true,
> When Fast Hash Table, then requires no Hybrid Hash Join IS true, Small table
> vectorizes IS true
> outputColumnNames: _col0, _col1, _col2, _col3
> input vertices:
> 0 Map 1
> Statistics: Num rows: 3 Data size: 323 Basic stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col0 (type: int)
> sort order: +
> Reduce Sink Vectorization:
> className: VectorReduceSinkOperator
> native: false
> nativeConditionsMet:
> hive.vectorized.execution.reducesink.new.enabled IS true,
> hive.execution.engine tez IN [tez, spark] IS true, Not ACID UPDATE or DELETE
> IS true, No buckets IS true, No TopN IS true, No DISTINCT columns IS true,
> BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true
> nativeConditionsNotMet: Uniform Hash IS false
> Statistics: Num rows: 3 Data size: 323 Basic stats:
> COMPLETE Column stats: NONE
> value expressions: _col1 (type: char(10)), _col2
> (type: int), _col3 (type: char(20))
> Execution mode: vectorized, llap
> LLAP IO: all inputs
> Map Vectorization:
> enabled: true
> enabledConditionsMet:
> hive.vectorized.use.vectorized.input.format IS true
> groupByVectorOutput: true
> inputFileFormats:
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
> allNative: false
> usesVectorUDFAdaptor: false
> vectorized: true
> Reducer 3
> Execution mode: vectorized, llap
> Reduce Vectorization:
> enabled: true
> enableConditionsMet: hive.vectorized.execution.reduce.enabled
> IS true, hive.execution.engine tez IN [tez, spark] IS true
> groupByVectorOutput: true
> allNative: false
> usesVectorUDFAdaptor: false
> vectorized: true
> Reduce Operator Tree:
> Select Operator
> expressions: KEY.reducesinkkey0 (type: int), VALUE._col0
> (type: char(10)), VALUE._col1 (type: int), VALUE._col2 (type: char(20))
> outputColumnNames: _col0, _col1, _col2, _col3
> Select Vectorization:
> className: VectorSelectOperator
> native: true
> projectedOutputColumns: [0, 1, 2, 3]
> Statistics: Num rows: 3 Data size: 323 Basic stats: COMPLETE
> Column stats: NONE
> File Output Operator
> compressed: false
> File Sink Vectorization:
> className: VectorFileSinkOperator
> native: false
> Statistics: Num rows: 3 Data size: 323 Basic stats:
> COMPLETE Column stats: NONE
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
> {code}
> EXPLAIN VECTORIZATION EXPRESSION
> Notice the predicateExpression in this example.
> {code}
> PLAN VECTORIZATION:
> enabled: true
> enabledConditionsMet: [hive.vectorized.execution.enabled IS true]
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
> STAGE PLANS:
> Stage: Stage-1
> Tez
> #### A masked pattern was here ####
> Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> #### A masked pattern was here ####
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: vector_interval_2
> Statistics: Num rows: 2 Data size: 788 Basic stats:
> COMPLETE Column stats: NONE
> TableScan Vectorization:
> native: true
> projectedOutputColumns: [0, 1, 2, 3, 4, 5]
> Filter Operator
> Filter Vectorization:
> className: VectorFilterOperator
> native: true
> predicateExpression: FilterExprAndExpr(children:
> FilterTimestampScalarEqualTimestampColumn(val 2001-01-01 01:02:03.0, col
> 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000)
> -> 6:timestamp) -> boolean, FilterTimestampScalarNotEqualTimestampColumn(val
> 2001-01-01 01:02:03.0, col 6)(children: DateColAddIntervalDayTimeScalar(col
> 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean,
> FilterTimestampScalarLessEqualTimestampColumn(val 2001-01-01 01:02:03.0, col
> 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000)
> -> 6:timestamp) -> boolean, FilterTimestampScalarLessTimestampColumn(val
> 2001-01-01 01:02:03.0, col 6)(children: DateColAddIntervalDayTimeScalar(col
> 1, val 0 01:02:04.000000000) -> 6:timestamp) -> boolean,
> FilterTimestampScalarGreaterEqualTimestampColumn(val 2001-01-01 01:02:03.0,
> col 6)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0
> 01:02:03.000000000) -> 6:timestamp) -> boolean,
> FilterTimestampScalarGreaterTimestampColumn(val 2001-01-01 01:02:03.0, col
> 6)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0
> 01:02:04.000000000) -> 6:timestamp) -> boolean,
> FilterTimestampColEqualTimestampScalar(col 6, val 2001-01-01
> 01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0
> 01:02:03.000000000) -> 6:timestamp) -> boolean,
> FilterTimestampColNotEqualTimestampScalar(col 6, val 2001-01-01
> 01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0
> 01:02:04.000000000) -> 6:timestamp) -> boolean,
> FilterTimestampColGreaterEqualTimestampScalar(col 6, val 2001-01-01
> 01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0
> 01:02:03.000000000) -> 6:timestamp) -> boolean,
> FilterTimestampColGreaterTimestampScalar(col 6, val 2001-01-01
> 01:02:03.0)(children: DateColAddIntervalDayTimeScalar(col 1, val 0
> 01:02:04.000000000) -> 6:timestamp) -> boolean,
> FilterTimestampColLessEqualTimestampScalar(col 6, val 2001-01-01
> 01:02:03.0)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0
> 01:02:03.000000000) -> 6:timestamp) -> boolean,
> FilterTimestampColLessTimestampScalar(col 6, val 2001-01-01
> 01:02:03.0)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0
> 01:02:04.000000000) -> 6:timestamp) -> boolean,
> FilterTimestampColEqualTimestampColumn(col 0, col 6)(children:
> DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:03.000000000) ->
> 6:timestamp) -> boolean, FilterTimestampColNotEqualTimestampColumn(col 0, col
> 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000)
> -> 6:timestamp) -> boolean, FilterTimestampColLessEqualTimestampColumn(col 0,
> col 6)(children: DateColAddIntervalDayTimeScalar(col 1, val 0
> 01:02:03.000000000) -> 6:timestamp) -> boolean,
> FilterTimestampColLessTimestampColumn(col 0, col 6)(children:
> DateColAddIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) ->
> 6:timestamp) -> boolean, FilterTimestampColGreaterEqualTimestampColumn(col 0,
> col 6)(children: DateColSubtractIntervalDayTimeScalar(col 1, val 0
> 01:02:03.000000000) -> 6:timestamp) -> boolean,
> FilterTimestampColGreaterTimestampColumn(col 0, col 6)(children:
> DateColSubtractIntervalDayTimeScalar(col 1, val 0 01:02:04.000000000) ->
> 6:timestamp) -> boolean) -> boolean
> predicate: ((2001-01-01 01:02:03.0 = (dt + 0
> 01:02:03.000000000)) and (2001-01-01 01:02:03.0 <> (dt + 0
> 01:02:04.000000000)) and (2001-01-01 01:02:03.0 <= (dt + 0
> 01:02:03.000000000)) and (2001-01-01 01:02:03.0 < (dt + 0
> 01:02:04.000000000)) and (2001-01-01 01:02:03.0 >= (dt - 0
> 01:02:03.000000000)) and (2001-01-01 01:02:03.0 > (dt - 0
> 01:02:04.000000000)) and ((dt + 0 01:02:03.000000000) = 2001-01-01
> 01:02:03.0) and ((dt + 0 01:02:04.000000000) <> 2001-01-01 01:02:03.0) and
> ((dt + 0 01:02:03.000000000) >= 2001-01-01 01:02:03.0) and ((dt + 0
> 01:02:04.000000000) > 2001-01-01 01:02:03.0) and ((dt - 0 01:02:03.000000000)
> <= 2001-01-01 01:02:03.0) and ((dt - 0 01:02:04.000000000) < 2001-01-01
> 01:02:03.0) and (ts = (dt + 0 01:02:03.000000000)) and (ts <> (dt + 0
> 01:02:04.000000000)) and (ts <= (dt + 0 01:02:03.000000000)) and (ts < (dt +
> 0 01:02:04.000000000)) and (ts >= (dt - 0 01:02:03.000000000)) and (ts > (dt
> - 0 01:02:04.000000000))) (type: boolean)
> Statistics: Num rows: 1 Data size: 394 Basic stats:
> COMPLETE Column stats: NONE
> Select Operator
> expressions: ts (type: timestamp)
> outputColumnNames: _col0
> Select Vectorization:
> className: VectorSelectOperator
> native: true
> projectedOutputColumns: [0]
> Statistics: Num rows: 1 Data size: 394 Basic stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col0 (type: timestamp)
> sort order: +
> Reduce Sink Vectorization:
> className: VectorReduceSinkOperator
> native: false
> nativeConditionsMet:
> hive.vectorized.execution.reducesink.new.enabled IS true,
> hive.execution.engine tez IN [tez, spark] IS true, Not ACID UPDATE or DELETE
> IS true, No buckets IS true, No TopN IS true, No DISTINCT columns IS true,
> BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true
> nativeConditionsNotMet: Uniform Hash IS false
> Statistics: Num rows: 1 Data size: 394 Basic stats:
> COMPLETE Column stats: NONE
> Execution mode: vectorized, llap
> LLAP IO: all inputs
> Map Vectorization:
> enabled: true
> enabledConditionsMet:
> hive.vectorized.use.vectorized.input.format IS true
> groupByVectorOutput: true
> inputFileFormats:
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
> allNative: false
> usesVectorUDFAdaptor: false
> vectorized: true
> Reducer 2
> ...
> {code}
> The standard @Explain Annotation Type is used. A new 'vectorization'
> annotation marks each new class and method.
> Works for FORMATTED, like other non-vectorization EXPLAIN variations.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)