[GitHub] [hudi] alexeykudinkin opened a new pull request, #7395: [MINOR] FIxing performance traps in Spark SQL `MERGE INTO` implementation

GitBox Tue, 06 Dec 2022 17:13:32 -0800


alexeykudinkin opened a new pull request, #7395:
URL: https://github.com/apache/hudi/pull/7395


   ### Change Logs
   
   This PR cleans up Spark SQL Merge Into implementation fixing some of the 
performance traps:
   
    - `SqlTypedRecord` relies on cache using Avro's `Schema`s as a key, 
therefore entailing that every lookup would entail `Schema.equals` which in 
turn has non-trivial overhead. Additional problem is that it's done for _every_ 
field, not just per-record which exacerbates the problem for wider tables.
   
   Changelog
    - Cleaned up `SqlTypedRecord`
    - Cleaned up `ExpressionCodeGen` to instead handle just Catalyst's 
`InternalRow`
    - Fixed caches w/ expression-evaluators to include target schema into the 
key to make sure there are no collisions in the cache in case multiple 
statements might be run w/in the same session
   
   ### Impact
   
   Should considerably improve performance for those who use Merge Into
   
   ### Risk level (write none, low medium or high below)
   
   Low
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] alexeykudinkin opened a new pull request, #7395: [MINOR] FIxing performance traps in Spark SQL `MERGE INTO` implementation

Reply via email to