nsivabalan opened a new pull request #1834:
URL: https://github.com/apache/hudi/pull/1834


   ## What is the purpose of the pull request
   
   Adding support for "bulk_insert_dataset" which has better performance 
compared to existing "bulk_insert". 
   
   ## Brief change log
   
   - Added support for "bulk_insert_dataset" which has better performance 
compared to existing "bulk_insert". 
   - This path introduces a new datasource called "org.apache.hudi.internal" 
and all supporting cast like DefaultSource, DataSourceWriter, 
DataWriterFactory, DataWriter, etc for the same.
   - This patch also introduces HoodieRowCreateHandle, 
HoodieInternalRowFileWriterFactory, HoodieInternalRowFileWriter, etc to assist 
in writing InternalRows to parquet. 
   - This patch adds changes to KeyGenerator to ensure getRecordKey and 
getPartitionPath is supported with Row for "bulk_insert_dataset". New apis are 
added to KeyGenerator, but default implementation is added so as to not have 
any breaking change. All keygenerator implementations have been fixed on this 
regards. 
   - Added HoodieDatasetBulkInsertHelper to assist in prepping the dataset 
before calling into datasource write. 
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added tests in HoodieSparkSqlWriterSuite to test happy path*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to