mpouttu opened a new pull request #2907:
URL: https://github.com/apache/hudi/pull/2907


   ## What is the purpose of the pull request
   
   Improve performance and resilience for large upserts 
   
   ## Brief change log
   
   A collect call causes resource issues with very large upserts, and is only 
used for reporting error messages that are already in the spark task logs. I 
replaced it with a .isEmpty() call and amended the error message to direct the 
user to the task logs. I also added a log statement to clearly indicate that 
BULK_INSERT is being used.
   
   ## Verify this pull request
   
   This issue only occurs with very large upserts of thin rows (600 million 
changed rows with 12 fields) on a resource constrained cluster. It is not 
possible to replicate it in unit tests or integration tests.
   
     - Pull and build the commit
     - Manually verify the fix on EMR
   
   ## Committer checklist
   
    - [ X] Has a corresponding JIRA in PR title & commit
    
    - [ X] Commit message is descriptive of the change
    
    - [ X] CI is green
   
    - [ X] Necessary doc changes done or have another open PR
          
    - [ X] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to