TheR1sing3un opened a new pull request, #12949: URL: https://github.com/apache/hudi/pull/12949
## Problems For engine specific record merger mode, i.e. spark record: For each record, we need to use cpu time to do some unnecessary operation, such as schema comparison, even if we have a global avro schema -> spark schema cache, but for the `cache.get()` operation, It will eventually call the `avro schema::equals` method, which will go through all the columns to compare. In the append scenario, our cache will hit every time, but each record will need to compare the avro schema completely, which will waste a lot of cpu time. We used a table in real production for the test: 1. 1000 columns 2. 64buckets 3. 50,000,000 records per partition 4. full schema written > for log written: <img width="1261" alt="image" src="https://github.com/user-attachments/assets/fc38f1c2-0d72-41d0-8288-8d8f5e16fc8d" /> There are 80%+ cpu time in the log write process is used to compare avro schema. > for snapshot read <img width="1234" alt="image" src="https://github.com/user-attachments/assets/47d51b62-3745-4815-9b2f-b484f762cf75" /> There are 57% cpu time in the snapshot-read is used to compare avro schema. > for compaction <img width="1217" alt="image" src="https://github.com/user-attachments/assets/bda73bdd-81d0-4367-a8b8-720a9d57fb46" /> There are 80%+ cpu time in the compaction is used to compare avro schema. ## Solution Introduce JVM level caching for avro schema to reduce the cost of schema comparison: Use an cache to cache references to the schema on key links where the schema may be created repeatedly. This ensures that only one variable instance of the same Schema will be used during a JVM lifetime, thus reducing the overhead of schema comparison on important IO paths. Because we only need to compare whether it is the same reference, there is no need to call the `Schema::equals` method. The above solutions can improve write/read/merge performance with a large number of columns, and the more columns, the higher the performance improvement. ## After optimized benchmark | code version | log written | full schema snapshot read | compaction | | --- | --- | --- | --- | | before optimized | 1367 | 914 | 554 | | after optimized | 5870 | 2658 | 2388 | <img width="1915" alt="image" src="https://github.com/user-attachments/assets/6ddecc58-9e0f-455e-9246-bbf4269f5b1d" /> ### Change Logs 1. Introduce JVM level caching for avro schema to reduce the cost of schema comparison: ### Impact Improve write/read/merge performance with a large number of columns, and the more columns, the higher the performance improvement. ### Risk level (write none, low medium or high below) low ### Documentation Update none ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [x] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
