danny0405 commented on code in PR #13865: URL: https://github.com/apache/hudi/pull/13865#discussion_r2331767014
########## rfc/rfc-101/rfc-101.md: ########## @@ -0,0 +1,71 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-101: Updates to the HoodieRecordMerger API + +## Proposers + +- @the-other-tim-brown + +## Approvers + - @<approver1 github username> + - @<approver2 github username> + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-9798 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract +The HoodieRecordMerger interface allows users to implement custom merging logic for records. The interface does not provide clear instructions on how to handle deletions and can currently lead to differences in behavior based on execution engine. +This RFC proposes updates to the interface to clarify the expectations so that out of order updates and deletes are handled properly across implementations. + +## Background +Currently, the HoodieRecordMerger interface implementations within the Hudi codebase are not consistent when it comes to how deletions are handled and can lead to differences in the results when merging records as part of the writer or reader path. Also, it is unclear from the current documentation of the interface for how the implementation should handle deleted records. + +As part of the migration to the new HoodieFileGroupReader, the merging logic has been refactored for the common use cases by creating a BufferedRecord, which wraps the data and key information about it such as the ordering values and record key, and RecordContext which provides a common interface for inspecting the underlying data. The new APIs can be used to simplify the implementation of the HoodieRecordMerger and also help users implement custom mergers that are engine agnostic. + +## Implementation +The HoodieRecordMerger interface will be updated to no longer rely on the HoodieRecord class, and instead use the `BufferedRecord` and `RecordContext` classes. +The `BufferedRecord` class contains the data as well as the record key, ordering value, schema identifier, and the `HoodieOperation`. + +The methods in the interface will be updated as follows: +```java +<T> BufferedRecord<T> partialMerge(BufferedRecord<T> older, BufferedRecord<T> newer, Schema readerSchema, RecordContext<T> recordContext, TypedProperties props) throws IOException +<T> BufferedRecord merge(BufferedRecord<T> older, BufferedRecord<T> newer, RecordContext<T> recordContext, TypedProperties props) throws IOException; Review Comment: If we have a abstraction here like `MergeConfig` to include all the info we need for merging, that would be great, but still we need to cover the customized props scenarios. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
