nsivabalan commented on code in PR #13865:
URL: https://github.com/apache/hudi/pull/13865#discussion_r2344578337


##########
rfc/rfc-101/rfc-101.md:
##########
@@ -0,0 +1,84 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-101: Updates to the HoodieRecordMerger API
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+ - @<approver1 github username>
+ - @<approver2 github username>
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-9798
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+The HoodieRecordMerger interface allows users to implement custom merging 
logic for records. The interface does not provide clear instructions on how to 
handle deletions and can currently lead to differences in behavior based on 
execution engine.
+This RFC proposes updates to the interface to clarify the expectations so that 
out of order updates and deletes are handled properly across implementations.
+
+## Background
+Currently, the HoodieRecordMerger interface implementations within the Hudi 
codebase are not consistent when it comes to how deletions are handled and can 
lead to differences in the results when merging records as part of the writer 
or reader path. Also, it is unclear from the current documentation of the 
interface for how the implementation should handle deleted records.
+
+As part of the migration to the new HoodieFileGroupReader, the merging logic 
has been refactored for the common use cases by creating a BufferedRecord, 
which wraps the data and key information about it such as the ordering values 
and record key, and RecordContext which provides a common interface for 
inspecting the underlying data. The new APIs can be used to simplify the 
implementation of the HoodieRecordMerger and also help users implement custom 
mergers that are engine agnostic.

Review Comment:
   an we add links to HoodieRecordMerger, HoodieFileGroupReader, 
BufferedRecord. 



##########
rfc/rfc-101/rfc-101.md:
##########
@@ -0,0 +1,84 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-101: Updates to the HoodieRecordMerger API
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+ - @<approver1 github username>
+ - @<approver2 github username>
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-9798
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+The HoodieRecordMerger interface allows users to implement custom merging 
logic for records. The interface does not provide clear instructions on how to 
handle deletions and can currently lead to differences in behavior based on 
execution engine.
+This RFC proposes updates to the interface to clarify the expectations so that 
out of order updates and deletes are handled properly across implementations.
+
+## Background
+Currently, the HoodieRecordMerger interface implementations within the Hudi 
codebase are not consistent when it comes to how deletions are handled and can 
lead to differences in the results when merging records as part of the writer 
or reader path. Also, it is unclear from the current documentation of the 
interface for how the implementation should handle deleted records.
+
+As part of the migration to the new HoodieFileGroupReader, the merging logic 
has been refactored for the common use cases by creating a BufferedRecord, 
which wraps the data and key information about it such as the ordering values 
and record key, and RecordContext which provides a common interface for 
inspecting the underlying data. The new APIs can be used to simplify the 
implementation of the HoodieRecordMerger and also help users implement custom 
mergers that are engine agnostic.
+
+## Implementation
+The HoodieRecordMerger interface will be updated to no longer rely on the 
HoodieRecord class, and instead use the `BufferedRecord` and `RecordContext` 
classes. 
+The `BufferedRecord` class contains the data as well as the record key, 
ordering value, schema identifier, and the `HoodieOperation`.
+
+The methods in the interface will be updated as follows:
+```java
+<T> BufferedRecord<T> partialMerge(BufferedRecord<T> older, BufferedRecord<T> 
newer, Schema readerSchema, RecordContext<T> recordContext, MergerConfig 
config) throws IOException
+<T> BufferedRecord merge(BufferedRecord<T> older, BufferedRecord<T> newer, 
RecordContext<T> recordContext, MergerConfig config) throws IOException;
+
+---
+
+/**
+ * MergerConfig is a configuration class used to pass all the required 
properties for merging records. 
+ * The record key, partition, and ordering field configurations along with 
options required by the 
+ * DeleteContext will be present in this object. Any custom options provided 
to the reader or writer 
+ * will also be passed to this object to ensure custom mergers and payloads 
have access to the properties 
+ * set by the user.
+ */
+public class MergerConfig extends HoodieConfig {
+   ...
+}
+```
+For both the `merge` and `partialMerge` methods, the implementation will mark 
the resulting `BufferedRecord` as a delete operation if the resulting record 
should result in a deletion of the row matching that record key. The 
implementation should also return the data in the `BufferedRecord` as non-null 
whenever possible to allow for the logic in future merge operations to 
reference the previous value of the data, even if that value results in a 
deletion. 

Review Comment:
   lets call out, that in addition to `data`, we also expect `orderingValue` to 
be set in `BufferedRecord` even for a deleted entry (which is one of the core 
fixes as outlined above) 



##########
rfc/rfc-101/rfc-101.md:
##########
@@ -0,0 +1,84 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-101: Updates to the HoodieRecordMerger API
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+ - @<approver1 github username>
+ - @<approver2 github username>
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-9798
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+The HoodieRecordMerger interface allows users to implement custom merging 
logic for records. The interface does not provide clear instructions on how to 
handle deletions and can currently lead to differences in behavior based on 
execution engine.
+This RFC proposes updates to the interface to clarify the expectations so that 
out of order updates and deletes are handled properly across implementations.
+
+## Background
+Currently, the HoodieRecordMerger interface implementations within the Hudi 
codebase are not consistent when it comes to how deletions are handled and can 
lead to differences in the results when merging records as part of the writer 
or reader path. Also, it is unclear from the current documentation of the 
interface for how the implementation should handle deleted records.
+
+As part of the migration to the new HoodieFileGroupReader, the merging logic 
has been refactored for the common use cases by creating a BufferedRecord, 
which wraps the data and key information about it such as the ordering values 
and record key, and RecordContext which provides a common interface for 
inspecting the underlying data. The new APIs can be used to simplify the 
implementation of the HoodieRecordMerger and also help users implement custom 
mergers that are engine agnostic.
+
+## Implementation
+The HoodieRecordMerger interface will be updated to no longer rely on the 
HoodieRecord class, and instead use the `BufferedRecord` and `RecordContext` 
classes. 
+The `BufferedRecord` class contains the data as well as the record key, 
ordering value, schema identifier, and the `HoodieOperation`.
+

Review Comment:
   can we list the older interface here as well just for reference so that its 
easier or reader to understand the changes. for eg, is it just `HoodieRecord` 
is being migrated to `BufferedRecord` or is there more to the changes. 
   
   For eg, I see we have removed `oldSchema` from the args. which was not 
apparent to be at first. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to