[GitHub] spark pull request #22348: Reduce unneeded operation in nextKeyValue process...

SongYadong Thu, 06 Sep 2018 01:32:37 -0700

GitHub user SongYadong opened a pull request:

    https://github.com/apache/spark/pull/22348


    Reduce unneeded operation in nextKeyValue process of parquet vectorized 
record reader

    ## What changes were proposed in this pull request?
    
    this PR do following in VectorizedParquetRecordReader class:
    
    1. In nextKeyValue() method, call to resultBatch() for only initializing a 
columnar batch if not initialized, not for a return of columnar batch. so we 
move initBatch() operation to nextBatch();
    
    2. In nextBatch() method, we need not reset columnVectors every time. When 
rowsReturned >= totalRowCount, function return, reset cost is vasted. so we put 
"if (rowsReturned >= totalRowCount) return false;" before columnVectors reset 
for performance.
    
    3. In nextBatch() method, we need not call checkEndOfRowGroup() every time. 
When rowsReturned != totalCountLoadedSoFar is true, checkEndOfRowGroup do 
nothing but just return, so we call checkEndOfRowGroup only when rowsReturned 
== totalCountLoadedSoFar for reducing function calling.
    
    4. In checkEndOfRowGroup() function, we need not get columns of 
requestedSchema every time. we get columns only for the first time and save it 
for future use for performance.
    
    Accoring to analysis of spark application with JMC tool, we found parquet 
vectorized record reader call nextKeyValue() and subsequent function very very 
frequent, performance gains from optimizition of this process is worth to do.
    
    
    ## How was this patch tested?
    
    1. Existing Unit Tests
    2. A test of running 4885 spark applications.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/SongYadong/spark parquet_vectorized_read

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22348.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22348
    
----
commit db5799b3eed27103bdaf50ac10b1da6758987619
Author: SongYadong <song.yadong1@...>
Date:   2018-09-06T08:16:34Z

    Reduce unneeded operation in nextKeyValue process of parquet vectorized 
record reader

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22348: Reduce unneeded operation in nextKeyValue process...

Reply via email to