Rajkumar Singh created HIVE-21935: ------------------------------------- Summary: Hive Vectorization : Server performance issue with vectorize UDF Key: HIVE-21935 URL: https://issues.apache.org/jira/browse/HIVE-21935 Project: Hive Issue Type: Bug Components: Vectorization Affects Versions: 3.1.1 Environment: Hive-3, JDK-8 Reporter: Rajkumar Singh
with vectorization turned on and hive.vectorized.adaptor.usage.mode=all we were seeing severe performance degradation. looking at the task jstacks it seems that it is running the code which vectorizes UDF and stuck in some loop. {code:java} jstack -l 14954 | grep 0x3af0 -A20 "TezChild" #15 daemon prio=5 os_prio=0 tid=0x00007f157538d800 nid=0x3af0 runnable [0x00007f1547581000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:573) at org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:350) at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setResult(VectorUDFAdaptor.java:205) at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:150) at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271) at org.apache.hadoop.hive.ql.exec.vector.expressions.ListIndexColScalar.evaluate(ListIndexColScalar.java:59) at org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:146) at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:965) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:938) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:125) at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:889) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) [yarn@hdp32b ~]$ jstack -l 14954 | grep 0x3af0 -A20 "TezChild" #15 daemon prio=5 os_prio=0 tid=0x00007f157538d800 nid=0x3af0 runnable [0x00007f1547581000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector.ensureSize(BytesColumnVector.java:554) at org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:570) at org.apache.hadoop.hive.ql.exec.vector.VectorAssignRow.assignRowColumn(VectorAssignRow.java:350) at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.setResult(VectorUDFAdaptor.java:205) at org.apache.hadoop.hive.ql.exec.vector.udf.VectorUDFAdaptor.evaluate(VectorUDFAdaptor.java:150) at org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression.evaluateChildren(VectorExpression.java:271) at org.apache.hadoop.hive.ql.exec.vector.expressions.ListIndexColScalar.evaluate(ListIndexColScalar.java:59) at org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:146) at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:965) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:938) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:125) at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:889) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:92) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:76) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:426) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) {code} after setting the hive.vectorized.adaptor.usage.mode=none query did complete much faster. Steps To Reproduce: 1. Create Table: {code} +----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE EXTERNAL TABLE `splittestloc`( | | `id` int, | | `value` string) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | | WITH SERDEPROPERTIES ( | | 'field.delim'=',', | | 'serialization.format'=',') | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.mapred.TextInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' | | LOCATION | | 'hdfs://hdp31a.hdp.local:8020/tmp/splittableloc' | | TBLPROPERTIES ( | | 'bucketing_version'='2', | | 'transient_lastDdlTime'='1561482451') | +----------------------------------------------------+ {code} 2. Sample data: table has some 40M rows and sample data is generated using following script. {code} for i in {1..40000000} ; do echo $i,"start#mid#"$i >> data.log ; done {code} 3. I believe this should be reproducible with hive generic split but I am attaching the custom UDF to split the string. 4. create a function {code} add jar /tmp/CustomSplit-1.0-SNAPSHOT.jar; create temporary function mysplit as 'com.rajkrrsingh.split.test.CustomSplit' {code} 5. run the following query which will reproduce the issue if vectorization turned on. {code} create temporary table tmp2 as select id,mysplit(value,"#")[2] from splittestloc {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)