Github user rdblue commented on a diff in the pull request:
    --- Diff: 
    @@ -63,115 +58,139 @@ public final void readBooleans(int total, 
WritableColumnVector c, int rowId) {
    +  private ByteBuffer getBuffer(int length) {
    +    try {
    +      return in.slice(length).order(ByteOrder.LITTLE_ENDIAN);
    +    } catch (IOException e) {
    +      throw new ParquetDecodingException("Failed to read " + length + " 
bytes", e);
    +    }
    +  }
       public final void readIntegers(int total, WritableColumnVector c, int 
rowId) {
    -    c.putIntsLittleEndian(rowId, total, buffer, offset - 
    -    offset += 4 * total;
    +    int requiredBytes = total * 4;
    +    ByteBuffer buffer = getBuffer(requiredBytes);
    +    for (int i = 0; i < total; i += 1) {
    --- End diff --
    The reason why I moved the loop out was that I didn't think that using the 
byte[] API would actually be better. Parquet doesn't guarantee that these byte 
buffers are on the heap and backed by byte arrays, so we would need to copy the 
bytes out of the buffer into an array only to copy them again in the column 
vector call. Two copies (and possibly allocation) seemed worse than moving the 
    We could catch the case where the buffers are on-heap and make the bulk 
call. The drawback is that this will be temporary and will be removed when 
ColumnVector supports ByteBuffer. And, it only works/matters when Parquet uses 
on-heap buffers and Spark uses off-heap buffers. Is that worth the change to 
this PR? I can take a shot at it if you think it is. I'd rather update 
ColumnVector and then follow up though.


To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to