Github user henryr commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r181882476
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
---
@@ -63,115 +58,139 @@ public final void readBooleans(int total,
WritableColumnVector c, int rowId) {
}
}
+ private ByteBuffer getBuffer(int length) {
+ try {
+ return in.slice(length).order(ByteOrder.LITTLE_ENDIAN);
+ } catch (IOException e) {
+ throw new ParquetDecodingException("Failed to read " + length + "
bytes", e);
+ }
+ }
+
@Override
public final void readIntegers(int total, WritableColumnVector c, int
rowId) {
- c.putIntsLittleEndian(rowId, total, buffer, offset -
Platform.BYTE_ARRAY_OFFSET);
- offset += 4 * total;
+ int requiredBytes = total * 4;
+ ByteBuffer buffer = getBuffer(requiredBytes);
+
+ for (int i = 0; i < total; i += 1) {
--- End diff --
Isn't that what `hasArray()` is for though? If the buffers are backed by a
byte array, `hasArray()` returns true and accessing the byte array via
`array()` should be 0 cost. (If `array()` actually copies any data, that would
invalidate this line of reasoning but would also be unexpected).
So for example, here you'd have:
public final void readIntegers(int total, WritableColumnVector c, int
rowId) {
int requiredBytes = total * 4;
ByteBuffer buffer = getBuffer(requiredBytes);
if (buffer.hasArray()) {
c.putIntsLittleEndian(rowId, total, buffer.array(), 0);
} else {
for (int i = 0; i < total; i += 1) {
c.putInt(rowId + i, buffer.getInt());
}
}
}
This seems to be the same pattern that's in `readBinary()`, below. Let me
know if I'm missing something!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]