[
https://issues.apache.org/jira/browse/HIVE-18524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375309#comment-16375309
]
Ke Jia commented on HIVE-18524:
-------------------------------
[~mmccline]:
HIVE-17139 mainly optimize vector- and row- expression.
For the vector- expression (for example IfExprDoubleColumnDoubleColumn.java),
If(expr1, expr2, expr3), When eveluate the children expression (expr1,expr2 and
expr3), Firstly, we compute the expr1 and the result stored in
batch.cols[arg1Column], where if the expr1 is true, the value of
batch.cols[arg1Column] is 1, or is 0. Then we compute the expr2 if the
batch.cols[arg1Column] is 1, or compute the expr3. After we eveluate the
children expression, the value of If expression is compute based on the result
of expr1, if the expr1 is 1, the value is expr2, or the value is expr3. I think
it will not be NPE like HIVE-18524. If I have wrong understanding, please tell
me, thanks.
For the row- expression (for example VectorUDFAdaptor.java):
We eveluate the children expression same as the vector- expression above. After
eveluated the children expression, the current implementation in
VectorUDFAdaptor gets the i-th row batch.cols[arg1Column][i],
batch.cols[arg2Column][i], batch.cols[arg3Column][i] and then wrap the result
with GenericUDF.DeferredObject passing to GenericUDFIf.java . And eveluate the
final value of If expression in GenericUDFIf.java base on the passed
GenericUDF.DeferredObject. The exception of HIVE-18524 is in the wrapping
result with GenericUDF.DeferredObject phase. For example, the value of If
expression is BytesColumnVector, in the i-th row, if the expr1 is 1, we will
skip compute expr3 during eveluating the children expression phase. So the
batch.cols[arg3Column][i] is null. And it will throws NPE. And our solution is
only wrap the satisfied value and skip the not-satisfied value. For example, if
the batch.cols[arg1Column][i] is 1, we only wrap the batch.cols[arg2Column][i]
and not wrap the batch.cols[arg3Column][i].
And this optimization can gain 17% improvement in Q06 on TPCx-BB and +40%
improvement in the complexity String operation. I think this optimization is
necessary.
> Vectorization: Execution failure related to non-standard embedding of
> IfExprConditionalFilter inside VectorUDFAdaptor (Revert HIVE-17139)
> -----------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-18524
> URL: https://issues.apache.org/jira/browse/HIVE-18524
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Affects Versions: 3.0.0
> Reporter: Matt McCline
> Assignee: Matt McCline
> Priority: Critical
> Fix For: 3.0.0
>
> Attachments: HIVE-18524.01.patch, HIVE-18524.02.patch
>
>
> {noformat}
> insert overwrite table insert_10_1
> select cast(gpa as float),
> age,
> IF(age>40,cast('2011-01-01 01:01:01' as timestamp),NULL),
> IF(LENGTH(name)>10,cast(name as binary),NULL)
> from studentnull10k
> vectorizationSchemaColumns: [0:name:string, 1:age:int, 2:gpa:double]
> ExprNodeDescs:
> UDFToFloat(gpa) (type: float),
> age (type: int),
> if((age > 40), 2011-01-01 01:01:01.0, null) (type: timestamp),
> if((length(name) > 10), CAST( name AS BINARY), null) (type: binary)
> selectExpressions:
> VectorUDFAdaptor(if((age > 40), 2011-01-01 01:01:01.0, null))
> (children: LongColGreaterLongScalar(col 1:int, val 40) -> 4:boolean)
> -> 5:timestamp,
> VectorUDFAdaptor(if((length(name) > 10), CAST( name AS BINARY), null))
> (children: LongColGreaterLongScalar(col 4:int, val 10)(children:
> StringLength(col 0:string) -> 4:int) -> 6:boolean,
> VectorUDFAdaptor(CAST( name AS BINARY)) -> 7:binary) -> 8:binary
> {noformat}
> *// Notice there is no vector expression shown for the last IF stmt.* It has
> been magically embedded inside the VectorUDFAdaptor object...
> Execution results in this call stack.
> {nocode}
> Caused by: java.lang.NullPointerException
> at java.util.Arrays.copyOfRange(Arrays.java:3521)
> at
> org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpressionWriterFactory$9.writeValue(VectorExpressionWriterFactory.java:1101)
> at
> org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpressionWriterFactory$VectorExpressionWriterBytes.writeValue(VectorExpressionWriterFactory.java:343)
> at
> org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFArgDesc.getDeferredJavaObject(VectorUDFArgDesc.java:123)
> at
> org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setResult(VectorUDFAdaptor.java:211)
> at
> org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:177)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:145)
> ... 22 more
> {nocode}
> Change is due to:
> HIVE-17139: Conditional expressions optimization: skip the expression
> evaluation if the condition is not satisfied for vectorization engine. (Jia
> Ke, reviewed by Ferdinand Xu)
> Embedding a raw vector expression outside of VectorizationContext is quite
> non-standard and evidently buggy.
> [~Ferd] [~Ke Jia] I am inclined to revert this change. Comments? CC:
> [~ashutoshc] [~hagleitn]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)