[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

GitBox Thu, 04 Aug 2022 11:58:46 -0700


alexeykudinkin commented on code in PR #5629:
URL: https://github.com/apache/hudi/pull/5629#discussion_r938154747



##########
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordMerger.java:
##########
@@ -30,9 +34,19 @@
  * It can implement the merging logic of HoodieRecord of different engines
  * and avoid the performance consumption caused by the 
serialization/deserialization of Avro payload.
  */
-public interface HoodieMerge extends Serializable {
-  
-  HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer);
+@PublicAPIClass(maturity = ApiMaturityLevel.EVOLVING)
+public interface HoodieRecordMerger extends Serializable {
+
+  /**
+   * This method converges combineAndGetUpdateValue and precombine from 
HoodiePayload.
+   * It'd be associative operation: f(a, f(b, c)) = f(f(a, b), c) (which we 
can translate as having 3 versions A, B, C
+   * of the single record, both orders of operations applications have to 
yield the same result)
+   */
+  Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema 
schema, Properties props) throws IOException;
 
-  Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, 
HoodieRecord newer, Schema schema, Properties props) throws IOException;
+  /**
+   * The record type handled by the current merger.
+   * SPARK, AVRO, FLINK
+   */
+  HoodieRecordType getRecordType();

Review Comment:
   @wzx140 we should actually do it in a similar fashion to how `KeyGenerator` 
interface is currently implemented:
   
   - You have generic API (`KeyGenerator`) accepting Avro (this is a bare 
minimum to be implemented, is used as a fallback)
   - You have engine specific API (`SparkKeyGeneratorInterface`) which provides 
for engine-specific APIs that you should implement natively for better 
performance.
   
   So when Acme implements their own `RecordMerger` they can choose to 
implement either 
   
   1. A bare-minimum API (Avro), that would allow it to work across the engines 
but will lack performance
   2. Or fully (Avro, Spark, etc) which will be performant across engines



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

Reply via email to