Hi Daniel, Thanks for the detailed write-up.
I can’t add much to the discussion, other than noting we also recently ran into the related oddity that we don’t need to define a precombine when writing data to a COW table (using Flink), but then trying to use Spark to drop partitions failed because there’s a default precombine field name (set to “ts”), and if that field doesn’t exist then the Spark job fails. — Ken > On Mar 31, 2023, at 1:20 PM, Daniel Kaźmirski <d.kazmir...@gmail.com> wrote: > > Hi all, > > I would like to bring up the topic of how precombine field is used and > what's the purpose of it. I would also like to know what are the plans for > it in the future. > > At first glance precombine filed looks like it's only used to deduplicate > records in incoming batch. > But when digging deeper it looks like it can/is also be used to: > 1. combine records not before but on write to decide if update existing > record (eg with DefaultHoodieRecordPayload) > 2. combine records on read for MoR table to combine log and base files > correctly. > 3. precombine field is required for spark SQL UPDATE, even if user can't > provide duplicates anyways with this sql statement. > > Regarding [3] there's inconsistency as precombine field is not required in > MERGE INTO UPDATE. Underneath UPSERT is switched to INSERT in upsert mode > to update existing records. > > I know that Hudi does a lot of work to ensure PK uniqueness across/within > partitions and there is a need to deduplicate records before write or to > deduplicate existing data if duplicates were introduced eg when using > non-strict insert mode. > > What should then happen in a situation where user does not want or can not > provide a pre-combine field? Then it's on user not to introduce duplicates, > but makes Hudi more generic and easier to use for "SQL" people. > > No precombine is possible for CoW, already, but UPSERT and SQL UPDATE is > not supported (but users can update records using Insert in non-strict mode > or MERGE INTO UPDATE). > There's also a difference between CoW and MoR where for MoR > precombine field is a hard requirement, but is optional for CoW. > (UPDATES with no precombine are also possible in Flink for both CoW and MoR > but not in Spark.) > > Would it make sense to take inspiration from some DBMS systems then (eg > Synapse) to allow updates and upserts when no precombine field is specified? > Scenario: > Say that duplicates were introduced with Insert in non-strict mode, no > precombine field is specified, then we have two options: > option 1) on UPDATE/UPSERT Hudi should deduplicate the existing records, as > there's no precombine field it's expected we don't know which records will > be removed and which will be effectively updated and preserved in the > table. (This can be also achieved by always providing the same value in > precombine field for all records.) > option 2) on UPDATE/UPSERT Hudi should deduplicate the existing records, as > there's no precombine field, record with the latest _hoodie_commit_time is > preserved and updated, other records with the same PK are removed. > > In both cases, deduplication on UPDATE/UPSERT becomes a hard rule > whether we use precombine field or not. > > Then regarding MoR and merging records on read (found this in Hudi format > spec), can it be done by only using _hoodie_commit_time in absence of > precombine field? > If so for both MoR and CoW precombine field can become completely optional? > > I'm of course looking at it more from the user perspective, it would be > nice to know what is and what is not possible from the design and developer > perspective. > > Best Regards, > Daniel Kaźmirski -------------------------- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch