nsivabalan commented on code in PR #13865: URL: https://github.com/apache/hudi/pull/13865#discussion_r2344578337
########## rfc/rfc-101/rfc-101.md: ########## @@ -0,0 +1,84 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-101: Updates to the HoodieRecordMerger API + +## Proposers + +- @the-other-tim-brown + +## Approvers + - @<approver1 github username> + - @<approver2 github username> + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-9798 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract +The HoodieRecordMerger interface allows users to implement custom merging logic for records. The interface does not provide clear instructions on how to handle deletions and can currently lead to differences in behavior based on execution engine. +This RFC proposes updates to the interface to clarify the expectations so that out of order updates and deletes are handled properly across implementations. + +## Background +Currently, the HoodieRecordMerger interface implementations within the Hudi codebase are not consistent when it comes to how deletions are handled and can lead to differences in the results when merging records as part of the writer or reader path. Also, it is unclear from the current documentation of the interface for how the implementation should handle deleted records. + +As part of the migration to the new HoodieFileGroupReader, the merging logic has been refactored for the common use cases by creating a BufferedRecord, which wraps the data and key information about it such as the ordering values and record key, and RecordContext which provides a common interface for inspecting the underlying data. The new APIs can be used to simplify the implementation of the HoodieRecordMerger and also help users implement custom mergers that are engine agnostic. Review Comment: an we add links to HoodieRecordMerger, HoodieFileGroupReader, BufferedRecord. ########## rfc/rfc-101/rfc-101.md: ########## @@ -0,0 +1,84 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-101: Updates to the HoodieRecordMerger API + +## Proposers + +- @the-other-tim-brown + +## Approvers + - @<approver1 github username> + - @<approver2 github username> + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-9798 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract +The HoodieRecordMerger interface allows users to implement custom merging logic for records. The interface does not provide clear instructions on how to handle deletions and can currently lead to differences in behavior based on execution engine. +This RFC proposes updates to the interface to clarify the expectations so that out of order updates and deletes are handled properly across implementations. + +## Background +Currently, the HoodieRecordMerger interface implementations within the Hudi codebase are not consistent when it comes to how deletions are handled and can lead to differences in the results when merging records as part of the writer or reader path. Also, it is unclear from the current documentation of the interface for how the implementation should handle deleted records. + +As part of the migration to the new HoodieFileGroupReader, the merging logic has been refactored for the common use cases by creating a BufferedRecord, which wraps the data and key information about it such as the ordering values and record key, and RecordContext which provides a common interface for inspecting the underlying data. The new APIs can be used to simplify the implementation of the HoodieRecordMerger and also help users implement custom mergers that are engine agnostic. + +## Implementation +The HoodieRecordMerger interface will be updated to no longer rely on the HoodieRecord class, and instead use the `BufferedRecord` and `RecordContext` classes. +The `BufferedRecord` class contains the data as well as the record key, ordering value, schema identifier, and the `HoodieOperation`. + +The methods in the interface will be updated as follows: +```java +<T> BufferedRecord<T> partialMerge(BufferedRecord<T> older, BufferedRecord<T> newer, Schema readerSchema, RecordContext<T> recordContext, MergerConfig config) throws IOException +<T> BufferedRecord merge(BufferedRecord<T> older, BufferedRecord<T> newer, RecordContext<T> recordContext, MergerConfig config) throws IOException; + +--- + +/** + * MergerConfig is a configuration class used to pass all the required properties for merging records. + * The record key, partition, and ordering field configurations along with options required by the + * DeleteContext will be present in this object. Any custom options provided to the reader or writer + * will also be passed to this object to ensure custom mergers and payloads have access to the properties + * set by the user. + */ +public class MergerConfig extends HoodieConfig { + ... +} +``` +For both the `merge` and `partialMerge` methods, the implementation will mark the resulting `BufferedRecord` as a delete operation if the resulting record should result in a deletion of the row matching that record key. The implementation should also return the data in the `BufferedRecord` as non-null whenever possible to allow for the logic in future merge operations to reference the previous value of the data, even if that value results in a deletion. Review Comment: lets call out, that in addition to `data`, we also expect `orderingValue` to be set in `BufferedRecord` even for a deleted entry (which is one of the core fixes as outlined above) ########## rfc/rfc-101/rfc-101.md: ########## @@ -0,0 +1,84 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-101: Updates to the HoodieRecordMerger API + +## Proposers + +- @the-other-tim-brown + +## Approvers + - @<approver1 github username> + - @<approver2 github username> + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-9798 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract +The HoodieRecordMerger interface allows users to implement custom merging logic for records. The interface does not provide clear instructions on how to handle deletions and can currently lead to differences in behavior based on execution engine. +This RFC proposes updates to the interface to clarify the expectations so that out of order updates and deletes are handled properly across implementations. + +## Background +Currently, the HoodieRecordMerger interface implementations within the Hudi codebase are not consistent when it comes to how deletions are handled and can lead to differences in the results when merging records as part of the writer or reader path. Also, it is unclear from the current documentation of the interface for how the implementation should handle deleted records. + +As part of the migration to the new HoodieFileGroupReader, the merging logic has been refactored for the common use cases by creating a BufferedRecord, which wraps the data and key information about it such as the ordering values and record key, and RecordContext which provides a common interface for inspecting the underlying data. The new APIs can be used to simplify the implementation of the HoodieRecordMerger and also help users implement custom mergers that are engine agnostic. + +## Implementation +The HoodieRecordMerger interface will be updated to no longer rely on the HoodieRecord class, and instead use the `BufferedRecord` and `RecordContext` classes. +The `BufferedRecord` class contains the data as well as the record key, ordering value, schema identifier, and the `HoodieOperation`. + Review Comment: can we list the older interface here as well just for reference so that its easier or reader to understand the changes. for eg, is it just `HoodieRecord` is being migrated to `BufferedRecord` or is there more to the changes. For eg, I see we have removed `oldSchema` from the args. which was not apparent to be at first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
