GitHub user bersprockets opened a pull request:

    https://github.com/apache/spark/pull/22188

    [SPARK-25164][SQL] Avoid rebuilding column and path list for each column in 
parquet reader

    ## What changes were proposed in this pull request?
    
    VectorizedParquetRecordReader::initializeInternal rebuilds the column list 
and path list once for each column. Therefore, it indirectly iterates 
2\*colCount\*colCount times for each parquet file.
    
    This inefficiency impacts jobs that read parquet-backed tables with many 
columns and many files. Jobs that read tables with few columns or few files are 
not impacted.
    
    This PR changes initializeInternal so that it builds each list only once.
    
    I ran benchmarks on my laptop with 1 worker thread, running this query:
    <pre>
    sql("select * from parquet_backed_table where id1 = 1").collect
    </pre>
    There are roughly one matching row for every 425 rows, and the matching 
rows are sprinkled pretty evenly throughout the table (that is, every page for 
column <code>id1</code> has at least one matching row).
    
    6000 columns, 1 million rows, 67 32M files:
    
    master | branch | improvement
    -------|---------|-----------
    10.87 min | 6.09 min | 44%
    
    6000 columns, 1 million rows, 23 98m files:
    
    master | branch | improvement
    -------|---------|-----------
    7.39 min | 5.80 min | 21%
    
    600 columns 10 million rows, 67 32M files:
    
    master | branch | improvement
    -------|---------|-----------
    1.95 min | 1.96 min | -0.5%
    
    60 columns, 100 million rows, 67 32M files:
    
    master | branch | improvement
    -------|---------|-----------
    0.55 min | 0.55 min | 0%
    
    ## How was this patch tested?
    
    - sql unit tests
    - pyspark-sql tests
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bersprockets/spark SPARK-25164

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22188.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22188
    
----
commit 697de21501acbda3dbcd8ccc13a35ad3723a652e
Author: Bruce Robbins <bersprockets@...>
Date:   2018-08-22T02:00:28Z

    Initial commit

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to