Hi Daniel,

Thanks for the detailed write-up.

I can’t add much to the discussion, other than noting we also recently ran into 
the related oddity that we don’t need to define a precombine when writing data 
to a COW table (using Flink), but then trying to use Spark to drop partitions 
failed because there’s a default precombine field name (set to “ts”), and if 
that field doesn’t exist then the Spark job fails.

— Ken


> On Mar 31, 2023, at 1:20 PM, Daniel Kaźmirski <d.kazmir...@gmail.com> wrote:
> 
> Hi all,
> 
> I would like to bring up the topic of how precombine field is used and
> what's the purpose of it. I would also like to know what are the plans for
> it in the future.
> 
> At first glance precombine filed looks like it's only used to deduplicate
> records in incoming batch.
> But when digging deeper it looks like it can/is also be used to:
> 1. combine records not before but on write to decide if update existing
> record (eg with DefaultHoodieRecordPayload)
> 2. combine records on read for MoR table to combine log and base files
> correctly.
> 3. precombine field is required for spark SQL UPDATE, even if user can't
> provide duplicates anyways with this sql statement.
> 
> Regarding [3] there's inconsistency as precombine field is not required in
> MERGE INTO UPDATE. Underneath UPSERT is switched to INSERT in upsert mode
> to update existing records.
> 
> I know that Hudi does a lot of work to ensure PK uniqueness across/within
> partitions and there is a need to deduplicate records before write or to
> deduplicate existing data if duplicates were introduced eg when using
> non-strict insert mode.
> 
> What should then happen in a situation where user does not want or can not
> provide a pre-combine field? Then it's on user not to introduce duplicates,
> but makes Hudi more generic and easier to use for "SQL" people.
> 
> No precombine is possible for CoW, already, but UPSERT and SQL UPDATE is
> not supported (but users can update records using Insert in non-strict mode
> or MERGE INTO UPDATE).
> There's also a difference between CoW and MoR where for MoR
> precombine field is a hard requirement, but is optional for CoW.
> (UPDATES with no precombine are also possible in Flink for both CoW and MoR
> but not in Spark.)
> 
> Would it make sense to take inspiration from some DBMS systems then (eg
> Synapse) to allow updates and upserts when no precombine field is specified?
> Scenario:
> Say that duplicates were introduced with Insert in non-strict mode, no
> precombine field is specified, then we have two options:
> option 1) on UPDATE/UPSERT Hudi should deduplicate the existing records, as
> there's no precombine field it's expected we don't know which records will
> be removed and which will be effectively updated and preserved in the
> table. (This can be also achieved by always providing the same value in
> precombine field for all records.)
> option 2) on UPDATE/UPSERT Hudi should deduplicate the existing records, as
> there's no precombine field, record with the latest _hoodie_commit_time is
> preserved and updated, other records with the same PK are removed.
> 
> In both cases, deduplication on UPDATE/UPSERT becomes a hard rule
> whether we use precombine field or not.
> 
> Then regarding MoR and merging records on read (found this in Hudi format
> spec), can it be done by only using _hoodie_commit_time in absence of
> precombine field?
> If so for both MoR and CoW precombine field can become completely optional?
> 
> I'm of course looking at it more from the user perspective, it would be
> nice to know what is and what is not possible from the design and developer
> perspective.
> 
> Best Regards,
> Daniel Kaźmirski

--------------------------
Ken Krugler
http://www.scaleunlimited.com
Custom big data solutions
Flink, Pinot, Solr, Elasticsearch



Reply via email to