guanziyue edited a comment on issue #3078: URL: https://github.com/apache/hudi/issues/3078#issuecomment-866499977
Hi tandonraghav I did some similar work before. Hope my experience could help you. First, as nanash mentioned before, we may call precombine method in two cases. First is dedup in ingestion. Second is in compaction. In compaction process, we first read log file use schema stored in log block to construct generic record and then have generic record transfer into payload. Then put them into a map. When we find duplicate key (yes they are ingested in different commit), we call precombine to combine two records with same key. This process is similar to hashJoin in spark. Finally, we got a map of payload which all key are unique. After that, we read record from parquet, use schema user provided in config to construct indexedRecord and call combinAndGetUpdateValue to merge payload in map and data from parquet. As you mentioned, it may not find schema in precombine. Could you please hold a reference to the schema in GenericRecord when payload is constructed as an attribute of class MongoHudiCDCPayload ? Then you can use schema in precombine method. And you may find that schema in avro 1.8.2 is not serializable, mark this attribute as transient may be a good idea. However, this may lead to schema lost in ingestion, as there is shuffle of payload in ingestion. You could recreate schema from properties arg in precombine when ingestion. This props is actually write config of hoodie. Note that you may not always get schema from config, try this when schema is null may be a good idea. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org