Re: [PR] [HUDI-8920] Optimized SerDe costs of Flink write (excluding bulk insert and append mode) [hudi]

via GitHub Sun, 16 Feb 2025 22:30:56 -0800


geserdugarov commented on code in PR #12796:
URL: https://github.com/apache/hudi/pull/12796#discussion_r1957589955



##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/BucketAssignFunction.java:
##########
@@ -81,8 +79,12 @@ public class BucketAssignFunction<K, I, O extends 
HoodieRecord<?>>
    *   <li>If it does, tag the record with the location</li>
    *   <li>If it does not, use the {@link BucketAssigner} to generate a new 
bucket ID</li>
    * </ul>
+   *
+   * ValueState here is structured as Tuple2(partition, fileId).
+   * We use Flink Tuple2 because this state will be serialized/deserialized.
+   * Otherwise, for any chosen data structure we should implement custom 
serializer.
    */
-  private ValueState<HoodieRecordGlobalLocation> indexState;
+  private ValueState<Tuple2<StringData, StringData>> indexState;

Review Comment:
   My bad, this place is actually doesn't require optimization, because at this 
place serde will be executed only during checkpoints, which is negligible in 
comparison with serde costs of each stream record.
   
   I've rollbacked previous implementation of `indexState`, which contains 
`HoodieRecordGlobalLocation`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-8920] Optimized SerDe costs of Flink write (excluding bulk insert and append mode) [hudi]

Reply via email to