[GitHub] [hudi] umehrot2 opened a new pull request #2893: [HUDI-1371] Support metadata based listing for Spark DataSource and Spark SQL

GitBox Wed, 28 Apr 2021 15:18:16 -0700


umehrot2 opened a new pull request #2893:
URL: https://github.com/apache/hudi/pull/2893



   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   This pr adds support for metadata based listing for Hudi Spark DataSource 
and Spark SQL based queries. The detailed design for Spark integration (V2 
implementation specifically) can be found at 
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements#RFC15:HUDIFileListingImprovements-Spark.
 Two parts of the V2 design have already been implemented:
   - Custom FileIndex for Hudi: https://github.com/apache/hudi/pull/2651
   - Registering Hudi tables as DataSource tables in Hive metastore so they are 
executed via Hudi DataSource instead of Hive InputFormat/Serde. In the process, 
it will also use the FileIndex implemented in Hudi DataSource: 
https://github.com/apache/hudi/pull/2283
   
   In this pr we build on top of the FileIndex implementation to get file 
listing using Hudi's metadata table if the feature is enabled, and otherwise 
fallback to distributed listing using Spark Context. The metadata table will be 
read just once and it will reduce O(N) list calls to O(1) get calls for N 
partitions. We also refactor the Hudi metadata table contract to add a new API 
which can fetch lists for multiple partitions (opens the reader just once). 
   
   ## Brief change log
   
   ## Verify this pull request
   
   - Existing unit tests updated
   - Internally on AWS EMR ran several performance tests via Spark DataSource 
and Spark SQL to observe improvements in query planning times
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] umehrot2 opened a new pull request #2893: [HUDI-1371] Support metadata based listing for Spark DataSource and Spark SQL

Reply via email to