guanziyue commented on issue #3078:
URL: https://github.com/apache/hudi/issues/3078#issuecomment-866499977


   Hi tandonraghav
   I did some similar work before. Hope my experience could help you.
   First, as nanash mentioned before, we may call precombine method in two 
cases. First is dedup in ingestion. Second is in compaction.
   In compaction process, we first read log file use schema stored in log block 
to construct generic record and then have generic record transfer into payload. 
Then put them into a map. When we find duplicate key (yes they are ingested in 
different commit), we call precombine to combine all records with same key. 
This process is similar to hashJoin in spark. Finally, we got a map of payload 
which all key are unique. After that, we read record from parquet, use schema 
user provided in config to construct indexedRecord and call 
combinAndGetUpdateValue to merge payload in map and data from parquet.
   As you mentioned, it may not find schema in precombine. Could you please 
hold a reference to the schema in GenericRecord when payload is constructed as 
an attribute of class MongoHudiCDCPayload ? Then you can use schema in 
precombine method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to