danny0405 commented on PR #5436:
URL: https://github.com/apache/hudi/pull/5436#issuecomment-1150723301
> For 1 - we don't know during querying whether this file slice was produced
by create handle or merge handle. So for a create handle with N records, we
write a CDC log block with N entries, each with `{op=I, before=null}`.
In my mind, there is no need for the reader to distinguish whether the
parquet comes from a INSERT(create handle) or UPSERT (merge handle), because
computing the diff on the fly is not a heavy work like you previously
mentioned.
But i agree, to keep a concise design, a general cdc log would simplify the
interface of the fs layout for downstream readers.
To simplify, we may just generate a MARKER file for it instead of a
redundant log file.
Another big confusion on my mind, is for MOR table, the MOR table has
different layout for different index,
like in Spark `BloomFilter` index, updates always write as a merge handle ,
only inserts was written to log files.
But, for Hbase index/Flink index, we always write log files, do we have a
design for this layout ? When and where to generate the cdc logs if we want to ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]