[GitHub] [hudi] boneanxs opened a new pull request, #6046: [HUDI-4363] Support Clustering row writer to improve performance

GitBox Tue, 05 Jul 2022 02:58:52 -0700


boneanxs opened a new pull request, #6046:
URL: https://github.com/apache/hudi/pull/6046


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   Enable row writer for clustering to improve performance
   
   ## Brief change log
   1. Integrate clustering with datasource read and write api, in this way,
      - enable clustering use Dataset api
      - Unify the read and write operations together, if read/write logic has 
improvement, clustering can also benefit
   2. Use hoodie.datasource.read.paths to pass paths for each 
clusteringOperation
   3. Introduce HoodieInternalWriteStatusCoordinator to persist the 
InternalWriteStatus of a clustering action. As we can not get it if using Spark 
datasource.
   4. Add new configures to control this behavior.
   
   ## Verify this pull request
   Manual test:
   A test table has 21 columns, 710716 rows, raw data size 929g(in spark 
memory), after compressed: 38.3g
   executor memory: 50g, 20 instances, and enable global_sort
   
   Without clustering as row: 32mins, 12sec
   Using clustering as row: 9mins, 51sec
   Also change existing tests(`TestHoodieSparkMergeOnReadTableClustering` and 
`testLayoutOptimizationFunctional`) to cover this feature 
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] boneanxs opened a new pull request, #6046: [HUDI-4363] Support Clustering row writer to improve performance

Reply via email to