Github user kiszk commented on a diff in the pull request:
https://github.com/apache/spark/pull/22348#discussion_r215860694
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java
---
@@ -154,8 +159,6 @@ public void close() throws IOException {
@Override
public boolean nextKeyValue() throws IOException {
- resultBatch();
--- End diff --
I like this direction to reduce overhead. I confirmed that the current code
calls `initBatch()` before calling `nextKeyValue()`, too.
My question is that is calling sequence is mandatory or optional. In the
future, I wonder a developer may call this method thru `hasNext()` before
calling `initBatch()`.
How about moving `if-statement` into here like? Does this help performance
improvement?
```
if (columnarBatch == null) resultBatch();
```
I am curious how orc reader handle this.
cc @dongjoon-hyun @cloud-fan
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]