Hi Hudi developers, Recently I took the initiative to finish the remaining tasks in RFC-46 <https://github.com/apache/hudi/blob/master/rfc/rfc-46/rfc-46.md> and this email attempts to solicit feedbacks, preference, and suggestions about API designs for record level operations (insert, delete, update, etc.) across different types of Hudi record payloads, e.g., InternalRow for Spark, ArrayWritable for Hive, which we believe should bring higher query performance and development efficiency.
A brief context. Hudi thrives to support more query engines, which brings various record payload types, e.g., InternalRow for Spark, ArrayWritable for Hive. Current payload API HoodieRecordPayload is designed for Avro mainly and serialization/deserialization is required for other payloads. To support all these payloads more efficiently, Hudi needs a new API that unifies record level operations, i.e., insert, delete, update, and works smoothly with existing HoodieRecordPayload interface. Please refer to RFC-46 <https://github.com/apache/hudi/blob/master/rfc/rfc-46/rfc-46.md> for more details if interested. Tentative Designs My high level considerations: An operation (insert, update, delete) normally consists of three phases in Hudi: before-merge phase, merge phase, and an optional after-merge phase, which correspond to the preCombine , combineAndGetUpdateValue and getInsertValue functions in HoodieRecordPayload interface. I aim to use fewer functions, i.e., one or two functions, to handle all operations in the HoodieRecordMerger interface. Since the logic in pre-merge phase and merge phase may not be the same, the function needs a stage parameter to differentiate them. I provide two tentative designs here to achieve the goal and welcome comments. // Design Attempt 1: // A single function `merge` handles all operations. // Each operation consists of three stages: precombine, combine, custome. // Precombine normally does dedup, and combine does the merge, // and custome injects some adhoc logic before the reacord // written to the disk or returned to the client. Enum Phase { PRECOMBINE, COMBINE, CUSTOME } public interface Merger { // Handle all operations, like insert, delete, update, including // the custom operations. Option<Pair<HoodieRecord<T>, Schema>> merge( HoodieRecord<T> oldRecord, Schema oldSchema, HoodieRecord<T> newRecord, Schema newSchema, TypedProperties props, Phase op); } // Design Alternative 2: // Two functions `merge` and `custom` are provided. // `merge` handles the precombine and combine logic, and `custom` handles // the adhoc logic, mainly for insert. // We separate them since `custom` handles a single record, // intuitively is not a merge operation. Enum Operation { PRECOMBINE, COMBINE } public interface Merger { // Handle general operations, like insert, delete, update. Option<Pair<HoodieRecord<T>, Schema>> merge( HoodieRecord<T> oldRecord, Schema oldSchema, HoodieRecord<T> newRecord, Schema newSchema, TypedProperties props, Operation op); // Impose custom operation before writing to disk or returning // to the client. Option<Pair<HoodieRecord<T>, Schema>> custom( HoodieRecord<T> record, Schema schema, TypedProperties props); }