Re: Review Request: Change ORC tree readers to return batches of rows instead of a row

Eric Hanson Tue, 23 Apr 2013 12:33:31 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10712/#review19568
-----------------------------------------------------------



I didn't get a chance to go through the whole thing -- here are just some 
initial comments.


ql/src/java/org/apache/hadoop/hive/ql/io/orc/DynamicByteArray.java
<https://reviews.apache.org/r/10712/#comment40418>

    should be totalLength not totalLenght



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReader.java
<https://reviews.apache.org/r/10712/#comment40420>

    the previous argument and the return value are always of class 
VectorizedRowBatch so why not declare them to be VectorizedRowBatch instead of 
Object?



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40461>

    put parens around expression on right side of =, for clarity



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40462>

    if noNulls is false, that means there are nulls, so you can't set noNulls 
back to true, can you?



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40423>

    You should never have to create a new column vector here because there 
should always be one passed down into this code as part of the batch.
    
    Consider assuming there will always be a vector. The code will just fail if 
there isn't one.



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40463>

    It appears you are replacing one isNull array (the one passed in) with 
another. We should be writing null information into the same Boolean array over 
and over. Am I missing something?



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40424>

    this code for IntTreeReader appears identical to the ShortTreeReader case. 
Can the code be shared, e.g. by inheriting from super-class?
    
    Since this is a vector, consider calling it nextVector



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40425>

    same comment -- can the code be shared for all int types?



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40474>

    for float/double, vector entries are supposed to be set to NaN for null 
values.



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40478>

    I think this could incorrectly decide something isRepeating if there are 
nulls, the way the code is now. But using NaN for nulls should prevent this. 
    
    I think this isRepeating check is probably worth it for performance but it 
might not really pay off. It might be worth a performance test sometime in the 
future to determine this.



ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java
<https://reviews.apache.org/r/10712/#comment40486>

    I'd said that NULL string entries should be set to the empty string in the 
vector. I'm not sure if any code depends on that though. So this is probably 
okay to not set null entries.



ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java
<https://reviews.apache.org/r/10712/#comment40427>

    Please add comment to say what is the purpose of the file created



ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java
<https://reviews.apache.org/r/10712/#comment40489>

    Should create the batch up here, not use a null batch. 
    
    Or at least have a second test that creates the batch outside and passes it 
down.



ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java
<https://reviews.apache.org/r/10712/#comment40490>

    style guidelines say put a blank line in front of all comments



ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java
<https://reviews.apache.org/r/10712/#comment40445>

    I think this should take and return VectorizedRowBatch since it would not 
make sense for it to be another type -- the method name even says "Batch"
    



ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java
<https://reviews.apache.org/r/10712/#comment40488>

    should add another test with some null values


- Eric Hanson


On April 22, 2013, 10:26 p.m., Sarvesh Sakalanaga wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10712/
> -----------------------------------------------------------
> 
> (Updated April 22, 2013, 10:26 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Description
> -------
> 
> The patch contains changes to ORC reader to return a batch of rows instead of 
> a row. A new method called nextBatch() is added to ORC reader and tree 
> readers of ORC. Currently only int,long,short,double,float,string and struct 
> support batch processing.
> 
> 
> This addresses bug HIVE-4370.
>     https://issues.apache.org/jira/browse/HIVE-4370
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/vector/BytesColumnVector.java 
> 246170d 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/DynamicByteArray.java fc4e53b 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReader.java 05240ce 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java d044cd8 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerReader.java 
> 2825c64 
>   ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedORCReader.java 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/10712/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sarvesh Sakalanaga
> 
>

Re: Review Request: Change ORC tree readers to return batches of rows instead of a row

Reply via email to