[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

henryr Mon, 16 Apr 2018 13:57:42 -0700

Github user henryr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21070#discussion_r181882476
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
 ---
    @@ -63,115 +58,139 @@ public final void readBooleans(int total, 
WritableColumnVector c, int rowId) {
         }
       }
     
    +  private ByteBuffer getBuffer(int length) {
    +    try {
    +      return in.slice(length).order(ByteOrder.LITTLE_ENDIAN);
    +    } catch (IOException e) {
    +      throw new ParquetDecodingException("Failed to read " + length + " 
bytes", e);
    +    }
    +  }
    +
       @Override
       public final void readIntegers(int total, WritableColumnVector c, int 
rowId) {
    -    c.putIntsLittleEndian(rowId, total, buffer, offset - 
Platform.BYTE_ARRAY_OFFSET);
    -    offset += 4 * total;
    +    int requiredBytes = total * 4;
    +    ByteBuffer buffer = getBuffer(requiredBytes);
    +
    +    for (int i = 0; i < total; i += 1) {
    --- End diff --
    
    Isn't that what `hasArray()` is for though? If the buffers are backed by a 
byte array, `hasArray()` returns true and accessing the byte array via 
`array()` should be 0 cost. (If `array()` actually copies any data, that would 
invalidate this line of reasoning but would also be unexpected).
    
    So for example, here you'd have:
        
        public final void readIntegers(int total, WritableColumnVector c, int 
rowId) {
            int requiredBytes = total * 4;
            ByteBuffer buffer = getBuffer(requiredBytes);
            if (buffer.hasArray()) {
              c.putIntsLittleEndian(rowId, total, buffer.array(), 0);         
            } else {
                for (int i = 0; i < total; i += 1) {
                    c.putInt(rowId + i, buffer.getInt());
                }
            }
        }
    
    This seems to be the same pattern that's in `readBinary()`, below. Let me 
know if I'm missing something!




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

Reply via email to