[ 
https://issues.apache.org/jira/browse/HIVE-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated HIVE-8474:
-----------------------------
    Attachment: HIVE-8474.patch

This patch makes several changes in vectorization.  [~mmccline] and 
[~ashutoshc], as I am not very familiar with this code and as I know the code 
is very performance sensitive I would appreciate your feedback on the patch.

The issue causing problems was that VectorizedBatchUtil.addRowToBatchFrom is 
used by VectorizedOrcAcidRowReader to take the merged rows from and acid read 
and put them in a vector batch.  But this method appears to have been built to 
be used by vector operators, not file formats where columns may be missing 
because they have been projected out or may already have values set as they are 
partition columns.  So I made the following changes:
# I changed addRowToBatchFrom to skip writing values into ColumnVectors that 
are null.  This handles the case where columns have been projected out and thus 
the ColumnVector is null.
# I changed VectorizedRowBatch to have a boolean array to track which columns 
are partition columns and VectorizedRowBatchCtx.createVectorizedRowBatch to 
populate this array
# I changed addRowToBatchFrom to skip writing values into ColumnVectors that 
are marked in VectorizedRowBatch as partition columns, since this results in 
overwriting the values that have already been put there by 
VectorizedRowBatchCtx.addPartitionColumnsToBatch

My concern is whether it is appropriate to mix in this functionality to skip 
projected out and partition columns into addRowToBatchFrom.  If you think it 
isn't good, I can write a new method to do this.  But that will involve a fair 
amount of duplicate code.  

[~owen.omalley], I also changed VectorizedOrcAcidRowReader to set the partition 
column values after every call to VectorizedRowBatch.reset in next.  Without 
doing this the code was NPEing later in the pipeline because the partition 
column had been set to null.  It appeared that you had copied the code from 
VectorizedOrcInputFormat, which only called addPartitionColsToBatch once, but 
which never called reset.  I tried removing the call to reset but that caused 
other issues.

> Vectorized reads of transactional tables fail when not all columns are 
> selected
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-8474
>                 URL: https://issues.apache.org/jira/browse/HIVE-8474
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions, Vectorization
>    Affects Versions: 0.14.0
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>            Priority: Critical
>             Fix For: 0.14.0
>
>         Attachments: HIVE-8474.patch
>
>
> {code}
> create table concur_orc_tab(name varchar(50), age int, gpa decimal(3, 2)) 
> clustered by (age) into 2 buckets stored as orc TBLPROPERTIES 
> ('transactional'='true');
> select name, age from concur_orc_tab order by name;
> {code}
> results in
> {code}
> Diagnostic Messages for this Task:
> Error: java.io.IOException: java.lang.NullPointerException
>         at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>         at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>         at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:352)
>         at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
>         at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
>         at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:115)
>         at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199)
>         at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NullPointerException
>         at 
> org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.setNullColIsNullValue(VectorizedBatchUtil.java:63)
>         at 
> org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.addRowToBatchFrom(VectorizedBatchUtil.java:443)
>         at 
> org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.addRowToBatch(VectorizedBatchUtil.java:214)
>         at 
> org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:95)
>         at 
> org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:43)
>         at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:347)
>         ... 13 more
> {code}
> The issue is that the object inspector passed to VectorizedOrcAcidRowReader 
> has all of the columns in the file rather than only the projected columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to