prashantwason opened a new issue, #18976: URL: https://github.com/apache/hudi/issues/18976
## Problem statement Every Hudi write produces commit metadata that records per-file and per-partition write statistics — `numInserts`, `numUpdates`, `numWrites`, `numDeletes`, and related counters. These stats are the primary source of truth that operators, pipelines, and reconciliation tooling use to answer the question: *"How many records did my write actually produce?"* However, when **deduplication** (`hoodie.combine.before.insert`) or **precombine** (during upsert) is enabled, multiple input records that share the same record key are collapsed into a single output record before anything is written. The commit metadata reports only the **final written count** — it does not report how many input records were collapsed along the way, or *why* the count shrank. This creates an **observability gap**: a discrepancy between input record count and written record count cannot be attributed to a cause. ### Concrete example Suppose an input RDD/Dataset contains 5 records that all share the same record key: ``` key=A, ts=1 key=A, ts=2 key=A, ts=3 key=A, ts=4 key=A, ts=5 ``` With dedup/precombine enabled, Hudi keeps one record (say `ts=5`) and writes it. The commit metadata reports: ``` numInserts = 1 ``` From this number alone, an operator **cannot tell the difference** between two very different scenarios: 1. **Expected behavior:** 4 records were legitimate duplicates, correctly collapsed by precombine. Data is fully intact. :white_check_mark: 2. **A bug / data loss:** records were silently dropped somewhere in the pipeline (a partitioning bug, a faulty merge, an index issue, etc.), and the "4 missing" records were *not* actually duplicates. :x: Both scenarios look identical in commit metadata: `5 in -> 1 out`. There is no field that says "4 of these were dropped as duplicates." ### Why this matters - **Data integrity / auditing:** Pipelines that reconcile source-vs-sink counts hit a dead end. A drop from 5 to 1 is unexplained, so it cannot be safely signed off as correct nor flagged as a real loss. - **Debugging:** When a genuine data-loss bug occurs, there is no metadata signal distinguishing it from normal dedup behavior, making root-cause analysis much harder. - **Trust:** Without dedup attribution, every count discrepancy requires manual, expensive investigation. ### Scope This applies to **both** write paths: - **Insert dedup** — duplicates dropped before insert when combine-before-insert is on. - **Upsert precombine** — multiple incoming records for the same key combined down to one (and combined against the existing record on disk). ## Proposed solution Extend Hudi commit metadata (`HoodieWriteStat` and the aggregated commit-level stats) with additional counters that make dedup/precombine explicit, for example: - `numDuplicates` / `numRecordsDeduplicated` — input records dropped because they shared a key with another input record. - `numPrecombined` — records eliminated by the precombine step specifically. With these stats, the invariant becomes verifiable: ``` numInputRecords == numWrites + numDeletes + numDuplicates (+ numErrors) ``` When this equation balances, a count drop is provably explained by deduplication. When it does **not** balance, the gap points at a real bug — turning a silent ambiguity into an actionable signal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
