[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

rdblue Mon, 16 Apr 2018 12:52:48 -0700

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21070#discussion_r181864492
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
 ---
    @@ -63,115 +58,139 @@ public final void readBooleans(int total, 
WritableColumnVector c, int rowId) {
         }
       }
     
    +  private ByteBuffer getBuffer(int length) {
    +    try {
    +      return in.slice(length).order(ByteOrder.LITTLE_ENDIAN);
    +    } catch (IOException e) {
    +      throw new ParquetDecodingException("Failed to read " + length + " 
bytes", e);
    +    }
    +  }
    +
       @Override
       public final void readIntegers(int total, WritableColumnVector c, int 
rowId) {
    -    c.putIntsLittleEndian(rowId, total, buffer, offset - 
Platform.BYTE_ARRAY_OFFSET);
    -    offset += 4 * total;
    +    int requiredBytes = total * 4;
    +    ByteBuffer buffer = getBuffer(requiredBytes);
    +
    +    for (int i = 0; i < total; i += 1) {
    --- End diff --
    
    The reason why I moved the loop out was that I didn't think that using the 
byte[] API would actually be better. Parquet doesn't guarantee that these byte 
buffers are on the heap and backed by byte arrays, so we would need to copy the 
bytes out of the buffer into an array only to copy them again in the column 
vector call. Two copies (and possibly allocation) seemed worse than moving the 
loop.
    
    We could catch the case where the buffers are on-heap and make the bulk 
call. The drawback is that this will be temporary and will be removed when 
ColumnVector supports ByteBuffer. And, it only works/matters when Parquet uses 
on-heap buffers and Spark uses off-heap buffers. Is that worth the change to 
this PR? I can take a shot at it if you think it is. I'd rather update 
ColumnVector and then follow up though.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

Reply via email to