geserdugarov commented on PR #12795: URL: https://github.com/apache/hudi/pull/12795#issuecomment-2655363635
Hi, all! I want to discuss again, what result we want in the end? Do we want to focus on abstraction and reusability, or focus on Flink performance? Rough description of what is happening in Flink write is the following. 1. `DataStream<RowData>` is converted into something with necessary Hudi metadata. 2.* Decision where to route records preparing filenames, etc., to rebalance stream between writers. 3. Actual writing into file system. 4. Table services. (We don't care about this step, because there is already no stream of records here.) Between those steps we serialize/deserialize each record. And it costs almost half of the total cost. I've checked switch to Flink `Tuple(metadata, RowData)` instead of `HoodieRecord`, and got amazing performance increase. But when I tried `Tuple(Tuple(metadata), RowData)`, I lost about 5% of performance. So, **even such kind of small change, one nested `Tuple`, does matter**. #### Option 1, reusability In this case, we need to implement proposed changes. And convert `RowData` into new `HoodieFlinkRecord` at _Operator 1_. But serde costs will be almost the same. #### Option 2, performance There is ready for review: https://github.com/apache/hudi/pull/12796 implemented switch to `HoodieFlinkInternalRow` instead of `HoodieRecord` at _Operator 1_ and _Operator 2_. `HoodieFlinkInternalRow` doesn't extend `HoodieRecord`, and contains only necessary data. `HoodieFlinkInternalRowSerializer` is implemented for maximum performance. But at _Operator 3_, conversion into Avro is made. #### Option 3, combination of both I see the most perspective roadmap is use `HoodieFlinkInternalRow` in _Operator 1_ and _Operator 2_, and switch to new `HoodieFlinkRecord` proposed here. There are two main steps in making huge Flink performance breakthrough with Hudi: 1. Optimize serde until writers. 2. Optimize data structures in writes. Step 1 is already implemented, and wait for review. Step 2 is proposed here, and would take some time. **I ask community to unblock work on step 1**, because I see misunderstanding on what is https://github.com/apache/hudi/pull/12796. @danny0405 , @yuzhaojing, @zhangyue19921010, @voonhous, @Alowator, @wombatu-kun, what do you think about it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
