MaxGekk opened a new pull request #26973: [SPARK-30323][SQL] Support filters 
pushdown in CSV datasource
URL: https://github.com/apache/spark/pull/26973
 
 
   ### What changes were proposed in this pull request?
   
   In the PR, I propose to support pushed down filters in CSV datasource. The 
filters are compiled to predicates and applied to values converted from parsed 
CSV fields.
   
   ### Why are the changes needed?
   The changes improve performance on synthetic benchmarks more t**han 9 
times** (on JDK 8 & 11):
   ```
   OpenJDK 64-Bit Server VM 11.0.5+10 on Mac OS X 10.15.2
   Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
   Filters pushdown:                         Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   w/o filters                                       11889          11945       
   52          0.0      118893.1       1.0X
   pushdown disabled                                 11790          11860       
  115          0.0      117902.3       1.0X
   w/ filters                                         1240           1278       
   33          0.1       12400.8       9.6X
   ```
   
   ### Does this PR introduce any user-facing change?
   No
   
   ### How was this patch tested?
   - Added new test suite `CSVFiltersSuite`
   - Added tests to `CSVSuite` and `UnivocityParserSuite`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to