bersprockets opened a new pull request #23392: [SPARK-26450][SQL] Avoid 
rebuilding map of schema for every column in projection
URL: https://github.com/apache/spark/pull/23392
 
 
   ## What changes were proposed in this pull request?
   
   When creating some unsafe projections, Spark rebuilds the map of schema 
attributes once for each expression in the projection. Some file format readers 
create one unsafe projection per input file, others create one per task. 
ProjectExec also creates one unsafe projection per task. As a result, for wide 
queries on wide tables, Spark might build the map of schema attributes hundreds 
of thousands of times.
   
   This PR changes two functions to reuse the same AttributeSeq instance when 
creating BoundReference objects for each expression in the projection. This 
avoids the repeated rebuilding of the map of schema attributes.
   
   ### Benchmarks
   
   The time saved by this PR depends on size of the schema, size of the 
projection, number of input files (or number of file splits), number of tasks, 
and file format. I chose a couple of example cases.
   
   In the following tests, I ran the query
   ```sql
   select * from table where id1 = 1
   ```
   
   Matching rows are about 0.2% of the table.
   
   #### Orc table 6000 columns, 500K rows, 34 input files
   
   baseline | pr | improvement
   ----|----|----
   1.772306 min | 1.487267 min | 16.082943%
   
   #### Orc table 6000 columns, 500K rows, *17* input files
   
   baseline | pr | improvement
   ----|----|----
    1.656400 min | 1.423550 min | -14.057595%
   
   #### Orc table 60 columns, 50M rows, 34 input files
   
   baseline | pr | improvement
   ----|----|----
   0.299878 min | 0.290339 min | 3.180926%
   
   #### Parquet table 6000 columns, 500K rows, 34 input files
   
   baseline | pr | improvement
   ----|----|----
   1.478306 min | 1.373728 min | 7.074165%
   
   Note: The parquet reader does not create an unsafe projection. However, the 
filter operation in the query causes the planner to add a ProjectExec, which 
does create an unsafe projection for each task. So these results have nothing 
to do with Parquet itself.
   
   #### Parquet table 60 columns, 50M rows, 34 input files
   
   baseline | pr | improvement
   ----|----|----
   0.245006 min | 0.242200 min | 1.145099%
   
   #### CSV table 6000 columns, 500K rows, 34 input files
   
   baseline | pr | improvement
   ----|----|----
   2.390117 min | 2.182778 min | 8.674844%
   
   #### CSV table 60 columns, 50M rows, 34 input files
   
   baseline | pr | improvement
   ----|----|----
   1.520911 min | 1.510211 min | 0.703526%
   
   ## How was this patch tested?
   
   SQL unit tests
   Python core and SQL test
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to