Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r181864492
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
---
@@ -63,115 +58,139 @@ public final void readBooleans(int total,
WritableColumnVector c, int rowId) {
}
}
+ private ByteBuffer getBuffer(int length) {
+ try {
+ return in.slice(length).order(ByteOrder.LITTLE_ENDIAN);
+ } catch (IOException e) {
+ throw new ParquetDecodingException("Failed to read " + length + "
bytes", e);
+ }
+ }
+
@Override
public final void readIntegers(int total, WritableColumnVector c, int
rowId) {
- c.putIntsLittleEndian(rowId, total, buffer, offset -
Platform.BYTE_ARRAY_OFFSET);
- offset += 4 * total;
+ int requiredBytes = total * 4;
+ ByteBuffer buffer = getBuffer(requiredBytes);
+
+ for (int i = 0; i < total; i += 1) {
--- End diff --
The reason why I moved the loop out was that I didn't think that using the
byte[] API would actually be better. Parquet doesn't guarantee that these byte
buffers are on the heap and backed by byte arrays, so we would need to copy the
bytes out of the buffer into an array only to copy them again in the column
vector call. Two copies (and possibly allocation) seemed worse than moving the
loop.
We could catch the case where the buffers are on-heap and make the bulk
call. The drawback is that this will be temporary and will be removed when
ColumnVector supports ByteBuffer. And, it only works/matters when Parquet uses
on-heap buffers and Spark uses off-heap buffers. Is that worth the change to
this PR? I can take a shot at it if you think it is. I'd rather update
ColumnVector and then follow up though.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]