Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r181864492 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java --- @@ -63,115 +58,139 @@ public final void readBooleans(int total, WritableColumnVector c, int rowId) { } } + private ByteBuffer getBuffer(int length) { + try { + return in.slice(length).order(ByteOrder.LITTLE_ENDIAN); + } catch (IOException e) { + throw new ParquetDecodingException("Failed to read " + length + " bytes", e); + } + } + @Override public final void readIntegers(int total, WritableColumnVector c, int rowId) { - c.putIntsLittleEndian(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET); - offset += 4 * total; + int requiredBytes = total * 4; + ByteBuffer buffer = getBuffer(requiredBytes); + + for (int i = 0; i < total; i += 1) { --- End diff -- The reason why I moved the loop out was that I didn't think that using the byte[] API would actually be better. Parquet doesn't guarantee that these byte buffers are on the heap and backed by byte arrays, so we would need to copy the bytes out of the buffer into an array only to copy them again in the column vector call. Two copies (and possibly allocation) seemed worse than moving the loop. We could catch the case where the buffers are on-heap and make the bulk call. The drawback is that this will be temporary and will be removed when ColumnVector supports ByteBuffer. And, it only works/matters when Parquet uses on-heap buffers and Spark uses off-heap buffers. Is that worth the change to this PR? I can take a shot at it if you think it is. I'd rather update ColumnVector and then follow up though.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org