alexeykudinkin opened a new pull request, #7395:
URL: https://github.com/apache/hudi/pull/7395
### Change Logs
This PR cleans up Spark SQL Merge Into implementation fixing some of the
performance traps:
- `SqlTypedRecord` relies on cache using Avro's `Schema`s as a key,
therefore entailing that every lookup would entail `Schema.equals` which in
turn has non-trivial overhead. Additional problem is that it's done for _every_
field, not just per-record which exacerbates the problem for wider tables.
Changelog
- Cleaned up `SqlTypedRecord`
- Cleaned up `ExpressionCodeGen` to instead handle just Catalyst's
`InternalRow`
- Fixed caches w/ expression-evaluators to include target schema into the
key to make sure there are no collisions in the cache in case multiple
statements might be run w/in the same session
### Impact
Should considerably improve performance for those who use Merge Into
### Risk level (write none, low medium or high below)
Low
### Documentation Update
N/A
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]