nsivabalan opened a new pull request #1834: URL: https://github.com/apache/hudi/pull/1834
## What is the purpose of the pull request Adding support for "bulk_insert_dataset" which has better performance compared to existing "bulk_insert". ## Brief change log - Added support for "bulk_insert_dataset" which has better performance compared to existing "bulk_insert". - This path introduces a new datasource called "org.apache.hudi.internal" and all supporting cast like DefaultSource, DataSourceWriter, DataWriterFactory, DataWriter, etc for the same. - This patch also introduces HoodieRowCreateHandle, HoodieInternalRowFileWriterFactory, HoodieInternalRowFileWriter, etc to assist in writing InternalRows to parquet. - This patch adds changes to KeyGenerator to ensure getRecordKey and getPartitionPath is supported with Row for "bulk_insert_dataset". New apis are added to KeyGenerator, but default implementation is added so as to not have any breaking change. All keygenerator implementations have been fixed on this regards. - Added HoodieDatasetBulkInsertHelper to assist in prepping the dataset before calling into datasource write. ## Verify this pull request *(Please pick either of the following options)* This change added tests and can be verified as follows: *(example:)* - *Added tests in HoodieSparkSqlWriterSuite to test happy path* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org