debugmiller opened a new issue, #1372:
URL: https://github.com/apache/iceberg-go/issues/1372

   ### Apache Iceberg version
   
   main (development)
   
   ### Please describe the bug 🐞
   
   When using v3 tables with merge-on-read mode, old DV are not cleaned up.
   
   Iceberg requires at most one deletion vector (DV) per data file per 
snapshot. A second `Delete()` touching a data file that already has a DV writes 
a brand-new Puffin/DV file instead of merging into the existing one, and never 
removes the old one. The commit succeeds, but the table is now invalid — any 
subsequent scan fails with `can't index multiple deletion vectors for <path>`.
   
   ### Minimal reproducer
   
   ```go
   func TestDoubleDeleteRegistersTwoDVsForSameFile(t *testing.T) {
        ctx := context.Background()
        tbl := newV3MergeOnReadTestTable(t) // format-version=3, 
write.delete.mode=merge-on-read
   
        arrowSchema := arrow.NewSchema([]arrow.Field{
                {Name: "id", Type: arrow.PrimitiveTypes.Int64, Nullable: false},
                {Name: "data", Type: arrow.BinaryTypes.String, Nullable: true},
        }, nil)
        data, _ := array.TableFromJSON(memory.DefaultAllocator, arrowSchema, 
[]string{
                
`[{"id":1,"data":"a"},{"id":2,"data":"b"},{"id":3,"data":"c"},{"id":4,"data":"d"},{"id":5,"data":"e"}]`,
        })
        tbl, _ = tbl.Append(ctx, array.NewTableReader(data, -1), nil)
   
        // First delete writes DV #1 against the single data file.
        tbl, _ = tbl.Delete(ctx, iceberg.EqualTo(iceberg.Reference("id"), 
int64(2)), nil)
   
        // Second delete SHOULD merge into DV #1. Instead it writes DV #2
        // against the same data file, and DV #1 is never removed.
        tbl, err := tbl.Delete(ctx, iceberg.EqualTo(iceberg.Reference("id"), 
int64(4)), nil)
        require.NoError(t, err) // commit "succeeds" — the corruption is silent 
here
   
        // Scanning the now-invalid table blows up:
        _, itr, _ := tbl.Scan().ToArrowRecords(ctx)
        for _, err := range itr {
                require.NoError(t, err) // fails: "can't index multiple 
deletion vectors for ..."
        }
   }
   ```
   
   `newV3MergeOnReadTestTable` is a one-line variant of 
[`newMergeOnReadTestTable`](https://github.com/debugmiller/iceberg-go/blob/da94843/table/mor_delete_pruning_test.go#L75-L96),
 with `PropertyFormatVersion: "3"` instead of `"2"`.
   
   ### Expected vs. actual
   
   - DVs referencing the data file after 2 deletes — expected **1** (merged 
bitmap); actual **2** (both live)
   - Scan after 2 deletes — expected: succeeds, 3 rows remain; actual: errors 
with `can't index multiple deletion vectors for <path>`
   
   ### Root cause
   
   - 
[`Transaction.classifyFilesForFilteredDeletions`](https://github.com/debugmiller/iceberg-go/blob/da94843/table/transaction.go#L1543-L1663)
 only walks data-file manifest entries — [line 
1606](https://github.com/debugmiller/iceberg-go/blob/da94843/table/transaction.go#L1606)
 skips anything that isn't `EntryContentData` — so it never looks up a data 
file's existing DV.
   - 
[`positionDeleteRecordsToDataFilesDV`](https://github.com/debugmiller/iceberg-go/blob/da94843/table/arrow_utils.go#L1845-L1898)
 always builds a fresh 
[`dv.NewDVWriter()`](https://github.com/debugmiller/iceberg-go/blob/da94843/table/arrow_utils.go#L1847)
 with no prior bitmap.
   - The new DV is appended via 
[`updater.appendDeleteFile(f)`](https://github.com/debugmiller/iceberg-go/blob/da94843/table/transaction.go#L1907)
 without ever calling 
[`updater.removeDeletionVector(...)`](https://github.com/debugmiller/iceberg-go/blob/da94843/table/snapshot_producers.go#L608-L616)
 on the superseded one — even though that method already exists and is used 
correctly by the compaction/`ReplaceFiles` path.
   - The read side already guards against the invalid state — 
[`buildDVIndex`](https://github.com/debugmiller/iceberg-go/blob/da94843/table/scanner.go#L561-L576)
 rejects a second DV for the same referenced file — so the failure surfaces at 
scan time, not commit time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to