SongYadong created SPARK-25354:
----------------------------------

             Summary: Parquet vectorized record reader has unneeded operation 
in several methods
                 Key: SPARK-25354
                 URL: https://issues.apache.org/jira/browse/SPARK-25354
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.0
            Reporter: SongYadong


VectorizedParquetRecordReader class has unneeded operation in nextKeyValue 
method and other functions called from it:

1. In nextKeyValue() method, we call resultBatch() for only initializing a 
columnar batch if not initialized, not for a return of columnar batch. so we 
can move initBatch() operation to nextBatch();

2. In nextBatch() method, we need not reset columnVectors every time. When 
rowsReturned >= totalRowCount, function return, reset cost is vasted. so we can 
put "if (rowsReturned >= totalRowCount) return false;" before columnVectors 
reset for performance.

3. In nextBatch() method, we need not call checkEndOfRowGroup() every time. 
When rowsReturned != totalCountLoadedSoFar is true, checkEndOfRowGroup do 
nothing but just return, so we can call checkEndOfRowGroup only when 
rowsReturned == totalCountLoadedSoFar for reducing function calling.

4. In checkEndOfRowGroup() function, we need not get columns of requestedSchema 
every time. we can get columns only for the first time and save it for future 
use for performance.

Accoring to analysis of spark application with JMC tool, we found parquet 
vectorized record reader call nextKeyValue() and subsequent function very very 
frequent, performance gains from optimizition of this process is worth to do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to