[PR] [HUDI-8799] Claim of RFC-84, Optimized SerDe of `DataStream` in Flink operators [hudi]

via GitHub Fri, 27 Dec 2024 23:17:19 -0800


geserdugarov opened a new pull request, #12550:
URL: https://github.com/apache/hudi/pull/12550


   ### Change Logs
   
   Currently for Flink stream write, we convert Flink RowData to HoodieRecord 
first, and then HoodieRecord is serialized and deserialized using Kryo. This 
SerDe costs are high, which leads to slower `DataStream` processing:
   
   ![1 - Kryo 
ser](https://github.com/user-attachments/assets/3977fee7-c663-45c5-a0cd-a3dbb7aa945a)
   ![2 - Kryo 
de](https://github.com/user-attachments/assets/2e9b957b-8f27-4c4e-9dcb-38cd7af01549)
   
   Using Flink's internal serialization instead of Kryo could decrease 
processing time of each record in stream significantly. For instance, if we try 
to switch to `Tuple2<Tuple2<BinaryStringData, BinaryStringData>, RowData>`, 
where first 2 strings are key and partition path, then SerDe costs are 
decreased significantly:
   
   ![3 - tuple 
ser](https://github.com/user-attachments/assets/5047b883-16c9-4a6f-b8ba-7862c2947448)
   ![4 - tuple 
de](https://github.com/user-attachments/assets/f777b54c-8dfb-4b98-bcd5-b8f6595d3452)
   
   The result comparison table is the following:
   
   |                                         | current with Kryo | POC version 
| Optimization |
   | --------------------------- | ------------------ | -------------- | 
-------------- |
   | CPU samples, serialize     | 33 900                 | 10 100          | 
**70.2%**      |
   | CPU samples, deserialize | 69 400                 | 10 500          | 
**84.5%**      |
   | **Data passed, GB**        | **19.4**               | **13.1**         | 
**32.5%**     |
   | **Total time, s**               | **247**                | **208**         
| **15.6%**     |
   
   This MR propose to start preparation of more detailed design of such kind of 
optimizations for Flink processing, for which processing time is crucial.
   
   ### Impact
   
   Improved Flink stream processing using Hudi.
   
   ### Risk level (write none, low medium or high below)
   
   Not available at this phase.
   
   ### Documentation Update
   
   Not available at this phase.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [ ] CI passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [HUDI-8799] Claim of RFC-84, Optimized SerDe of `DataStream` in Flink operators [hudi]

Reply via email to