bersprockets opened a new pull request #23336: [SPARK-26378][SQL] Restore 
performance of queries against wide CSV tables
URL: https://github.com/apache/spark/pull/23336
 
 
   ## What changes were proposed in this pull request?
   
   After recent changes to CSV parsing to return partial results for bad CSV 
records, queries of wide CSV tables slowed considerably. That recent change 
resulted in every row being recreated, even when the associated input record 
had no parsing issues and the user specified no corrupt record field in his/her 
schema
   
   In this PR,  I propose that a row should be recreated only if there is a 
parsing error or columns need to be shifted due to the existence of a corrupt 
column field in the user-supplied schema. Otherwise, the row should be used 
as-is. This restores performance for the non-error case only.
   
   ### Benchmarks:
   
   baseline = commit before partial results change
   PR = this PR
   master = master branch
   
   The wide table has 6000 columns and 165,000 records, and the narrow table 
has 12 columns and 82,500,000 records. Tests are run with a single executor.
   
   In the following, positive percentages are bad (slower), negative are good 
(faster).
   
   #### Wide rows, all good records:
   
   baseline | pr | master | PR diff | master diff
   -----------|-----|-----------|-----------|---------------
   2.036489 min | 1.990344 min | 2.952561 min | -2.265882% | 44.982923%
   
   #### Wide rows, all bad records
   
   baseline | pr | master | PR diff | master diff
   -----------|-----|-----------|-----------|---------------
   1.660761 min | 3.016839 min | 3.011944 min | 81.653994% | 81.359283%
   
   Both my PR and the master branch are ~81% slower than the baseline when all 
records are bad but the user specified no corrupt record field in his/her 
schema. In fact, the master branch is reliably, but slightly, faster here, 
since it does not call badRecord() in this case.
   
   #### Wide rows, corrupt record field, all good records
   
   baseline | pr | master | PR diff | master diff
   -----------|-----|-----------|-----------|---------------
   2.912467 min | 2.893039 min | 2.905344 min | -0.667056% | -0.244543%
   
   #### Wide rows, corrupt record field, all bad records
   
   baseline | pr | master | PR diff | master diff
   -----------|-----|-----------|-----------|---------------
   2.441417 min | 2.979544 min | 2.957439 min | 22.041620% | 21.136180%
   
   Both my PR and the master branch are ~21-22% slower than the baseline when 
all records are bad and the user specified a corrupt record field in his/her 
schema.
   
   #### Narrow rows, all good records
   
   baseline | pr | master | diff1 | diff2
   -----------|-----|-----------|-----------|---------------
   2.004539 min | 1.987183 min | 2.365122 min | -0.865813% | 17.988343%
   
   #### Narrow rows, corrupt record field, all good records
   
   baseline | pr | master | diff1 | diff2
   -----------|-----|-----------|-----------|---------------
   2.390589 min | 2.382100 min | 2.379733 min | -0.355096% | -0.454095%
   ## How was this patch tested?
   
   All SQL unit tests
   Python core and SQL tests
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to