mpouttu opened a new pull request #2907:
URL: https://github.com/apache/hudi/pull/2907
## What is the purpose of the pull request
Improve performance and resilience for large upserts
## Brief change log
A collect call causes resource issues with very large upserts, and is only
used for reporting error messages that are already in the spark task logs. I
replaced it with a .isEmpty() call and amended the error message to direct the
user to the task logs. I also added a log statement to clearly indicate that
BULK_INSERT is being used.
## Verify this pull request
This issue only occurs with very large upserts of thin rows (600 million
changed rows with 12 fields) on a resource constrained cluster. It is not
possible to replicate it in unit tests or integration tests.
- Pull and build the commit
- Manually verify the fix on EMR
## Committer checklist
- [ X] Has a corresponding JIRA in PR title & commit
- [ X] Commit message is descriptive of the change
- [ X] CI is green
- [ X] Necessary doc changes done or have another open PR
- [ X] For large changes, please consider breaking it into sub-tasks under
an umbrella JIRA.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]