boneanxs opened a new pull request, #6046: URL: https://github.com/apache/hudi/pull/6046
## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.* ## What is the purpose of the pull request Enable row writer for clustering to improve performance ## Brief change log 1. Integrate clustering with datasource read and write api, in this way, - enable clustering use Dataset api - Unify the read and write operations together, if read/write logic has improvement, clustering can also benefit 2. Use hoodie.datasource.read.paths to pass paths for each clusteringOperation 3. Introduce HoodieInternalWriteStatusCoordinator to persist the InternalWriteStatus of a clustering action. As we can not get it if using Spark datasource. 4. Add new configures to control this behavior. ## Verify this pull request Manual test: A test table has 21 columns, 710716 rows, raw data size 929g(in spark memory), after compressed: 38.3g executor memory: 50g, 20 instances, and enable global_sort Without clustering as row: 32mins, 12sec Using clustering as row: 9mins, 51sec Also change existing tests(`TestHoodieSparkMergeOnReadTableClustering` and `testLayoutOptimizationFunctional`) to cover this feature ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
