geserdugarov opened a new pull request, #12550: URL: https://github.com/apache/hudi/pull/12550
### Change Logs Currently for Flink stream write, we convert Flink RowData to HoodieRecord first, and then HoodieRecord is serialized and deserialized using Kryo. This SerDe costs are high, which leads to slower `DataStream` processing:   Using Flink's internal serialization instead of Kryo could decrease processing time of each record in stream significantly. For instance, if we try to switch to `Tuple2<Tuple2<BinaryStringData, BinaryStringData>, RowData>`, where first 2 strings are key and partition path, then SerDe costs are decreased significantly:   The result comparison table is the following: | | current with Kryo | POC version | Optimization | | --------------------------- | ------------------ | -------------- | -------------- | | CPU samples, serialize | 33 900 | 10 100 | **70.2%** | | CPU samples, deserialize | 69 400 | 10 500 | **84.5%** | | **Data passed, GB** | **19.4** | **13.1** | **32.5%** | | **Total time, s** | **247** | **208** | **15.6%** | This MR propose to start preparation of more detailed design of such kind of optimizations for Flink processing, for which processing time is crucial. ### Impact Improved Flink stream processing using Hudi. ### Risk level (write none, low medium or high below) Not available at this phase. ### Documentation Update Not available at this phase. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
