[
https://issues.apache.org/jira/browse/HIVE-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alan Gates updated HIVE-8474:
-----------------------------
Attachment: HIVE-8474.patch
This patch makes several changes in vectorization. [~mmccline] and
[~ashutoshc], as I am not very familiar with this code and as I know the code
is very performance sensitive I would appreciate your feedback on the patch.
The issue causing problems was that VectorizedBatchUtil.addRowToBatchFrom is
used by VectorizedOrcAcidRowReader to take the merged rows from and acid read
and put them in a vector batch. But this method appears to have been built to
be used by vector operators, not file formats where columns may be missing
because they have been projected out or may already have values set as they are
partition columns. So I made the following changes:
# I changed addRowToBatchFrom to skip writing values into ColumnVectors that
are null. This handles the case where columns have been projected out and thus
the ColumnVector is null.
# I changed VectorizedRowBatch to have a boolean array to track which columns
are partition columns and VectorizedRowBatchCtx.createVectorizedRowBatch to
populate this array
# I changed addRowToBatchFrom to skip writing values into ColumnVectors that
are marked in VectorizedRowBatch as partition columns, since this results in
overwriting the values that have already been put there by
VectorizedRowBatchCtx.addPartitionColumnsToBatch
My concern is whether it is appropriate to mix in this functionality to skip
projected out and partition columns into addRowToBatchFrom. If you think it
isn't good, I can write a new method to do this. But that will involve a fair
amount of duplicate code.
[~owen.omalley], I also changed VectorizedOrcAcidRowReader to set the partition
column values after every call to VectorizedRowBatch.reset in next. Without
doing this the code was NPEing later in the pipeline because the partition
column had been set to null. It appeared that you had copied the code from
VectorizedOrcInputFormat, which only called addPartitionColsToBatch once, but
which never called reset. I tried removing the call to reset but that caused
other issues.
> Vectorized reads of transactional tables fail when not all columns are
> selected
> -------------------------------------------------------------------------------
>
> Key: HIVE-8474
> URL: https://issues.apache.org/jira/browse/HIVE-8474
> Project: Hive
> Issue Type: Bug
> Components: Transactions, Vectorization
> Affects Versions: 0.14.0
> Reporter: Alan Gates
> Assignee: Alan Gates
> Priority: Critical
> Fix For: 0.14.0
>
> Attachments: HIVE-8474.patch
>
>
> {code}
> create table concur_orc_tab(name varchar(50), age int, gpa decimal(3, 2))
> clustered by (age) into 2 buckets stored as orc TBLPROPERTIES
> ('transactional'='true');
> select name, age from concur_orc_tab order by name;
> {code}
> results in
> {code}
> Diagnostic Messages for this Task:
> Error: java.io.IOException: java.lang.NullPointerException
> at
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
> at
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
> at
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:352)
> at
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
> at
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
> at
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:115)
> at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199)
> at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.setNullColIsNullValue(VectorizedBatchUtil.java:63)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.addRowToBatchFrom(VectorizedBatchUtil.java:443)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.addRowToBatch(VectorizedBatchUtil.java:214)
> at
> org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:95)
> at
> org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:43)
> at
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:347)
> ... 13 more
> {code}
> The issue is that the object inspector passed to VectorizedOrcAcidRowReader
> has all of the columns in the file rather than only the projected columns.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)