[GitHub] [hudi] danny0405 commented on pull request #5436: [RFC-51] [HUDI-3478] Change Data Capture RFC

GitBox Wed, 08 Jun 2022 23:28:21 -0700


danny0405 commented on PR #5436:
URL: https://github.com/apache/hudi/pull/5436#issuecomment-1150723301


   > For 1 - we don't know during querying whether this file slice was produced 
by create handle or merge handle. So for a create handle with N records, we 
write a CDC log block with N entries, each with `{op=I, before=null}`.
   
   In my mind, there is no need for the reader to distinguish whether the 
parquet comes from a INSERT(create handle) or UPSERT (merge handle), because 
computing the diff on the fly is not a heavy work like you previously 
mentioned. 
   
   But i agree, to keep a concise design, a general cdc log would simplify the 
interface of the fs layout for downstream readers.
   To simplify, we may just  generate a MARKER file for it instead of a 
redundant log file.
   
   
   Another big confusion on my mind, is for MOR table, the MOR table has 
different layout for different index,
   like in Spark `BloomFilter` index, updates always write as a merge handle , 
only inserts was written to log files.
   
   But, for Hbase index/Flink index, we always write log files, do we have a 
design for this layout ? When and where to generate the cdc logs if we want to ？
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] danny0405 commented on pull request #5436: [RFC-51] [HUDI-3478] Change Data Capture RFC

Reply via email to