Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22348#discussion_r215860694
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java
 ---
    @@ -154,8 +159,6 @@ public void close() throws IOException {
     
       @Override
       public boolean nextKeyValue() throws IOException {
    -    resultBatch();
    --- End diff --
    
    I like this direction to reduce overhead. I confirmed that the current code 
calls `initBatch()` before calling `nextKeyValue()`, too.  
    My question is that is calling sequence is mandatory or optional. In the 
future, I wonder a developer may call this method thru `hasNext()` before 
calling `initBatch()`.
    
    How about moving `if-statement` into here like? Does this help performance 
improvement?
    ```
    if (columnarBatch == null) resultBatch();
    ```
    I am curious how orc reader handle this.
    cc @dongjoon-hyun @cloud-fan 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to