MaxGekk opened a new pull request #27366: [WIP][SPARK-30648][SQL] Support filters pushdown in JSON datasource URL: https://github.com/apache/spark/pull/27366 ### What changes were proposed in this pull request? In the PR, I propose to support pushed down filters in JSON datasource. The reason of pushing a filter up to `JacksonParser` is to apply the filter as soon as all its attributes become available i.e. converted from JSON field values to desired values according to the schema. This allows to skip conversions of other values if the filter returns `false`. This can improve performance when pushed filters are highly selective and conversion of JSON string fields to desired values are comparably expensive ( for example, conversion to `TIMESTAMP` values). ### Why are the changes needed? The changes improve performance on synthetic benchmarks up to **20 times** (on JDK 8): ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.2 Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ w/o filters 13083 13150 61 0.0 130827.2 1.0X pushdown disabled 13039 13115 78 0.0 130387.7 1.0X w/ filters 637 649 10 0.2 6369.7 20.5X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Add new test suites `JsonFiltersSuite` and `JacksonParserSuite`. - By new end-to-end test in `JsonSuite`.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
