umehrot2 opened a new pull request #2893: URL: https://github.com/apache/hudi/pull/2893
## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request This pr adds support for metadata based listing for Hudi Spark DataSource and Spark SQL based queries. The detailed design for Spark integration (V2 implementation specifically) can be found at https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements#RFC15:HUDIFileListingImprovements-Spark. Two parts of the V2 design have already been implemented: - Custom FileIndex for Hudi: https://github.com/apache/hudi/pull/2651 - Registering Hudi tables as DataSource tables in Hive metastore so they are executed via Hudi DataSource instead of Hive InputFormat/Serde. In the process, it will also use the FileIndex implemented in Hudi DataSource: https://github.com/apache/hudi/pull/2283 In this pr we build on top of the FileIndex implementation to get file listing using Hudi's metadata table if the feature is enabled, and otherwise fallback to distributed listing using Spark Context. The metadata table will be read just once and it will reduce O(N) list calls to O(1) get calls for N partitions. We also refactor the Hudi metadata table contract to add a new API which can fetch lists for multiple partitions (opens the reader just once). ## Brief change log ## Verify this pull request - Existing unit tests updated - Internally on AWS EMR ran several performance tests via Spark DataSource and Spark SQL to observe improvements in query planning times ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
