debugmiller opened a new issue, #1372:
URL: https://github.com/apache/iceberg-go/issues/1372
### Apache Iceberg version
main (development)
### Please describe the bug 🐞
When using v3 tables with merge-on-read mode, old DV are not cleaned up.
Iceberg requires at most one deletion vector (DV) per data file per
snapshot. A second `Delete()` touching a data file that already has a DV writes
a brand-new Puffin/DV file instead of merging into the existing one, and never
removes the old one. The commit succeeds, but the table is now invalid — any
subsequent scan fails with `can't index multiple deletion vectors for <path>`.
### Minimal reproducer
```go
func TestDoubleDeleteRegistersTwoDVsForSameFile(t *testing.T) {
ctx := context.Background()
tbl := newV3MergeOnReadTestTable(t) // format-version=3,
write.delete.mode=merge-on-read
arrowSchema := arrow.NewSchema([]arrow.Field{
{Name: "id", Type: arrow.PrimitiveTypes.Int64, Nullable: false},
{Name: "data", Type: arrow.BinaryTypes.String, Nullable: true},
}, nil)
data, _ := array.TableFromJSON(memory.DefaultAllocator, arrowSchema,
[]string{
`[{"id":1,"data":"a"},{"id":2,"data":"b"},{"id":3,"data":"c"},{"id":4,"data":"d"},{"id":5,"data":"e"}]`,
})
tbl, _ = tbl.Append(ctx, array.NewTableReader(data, -1), nil)
// First delete writes DV #1 against the single data file.
tbl, _ = tbl.Delete(ctx, iceberg.EqualTo(iceberg.Reference("id"),
int64(2)), nil)
// Second delete SHOULD merge into DV #1. Instead it writes DV #2
// against the same data file, and DV #1 is never removed.
tbl, err := tbl.Delete(ctx, iceberg.EqualTo(iceberg.Reference("id"),
int64(4)), nil)
require.NoError(t, err) // commit "succeeds" — the corruption is silent
here
// Scanning the now-invalid table blows up:
_, itr, _ := tbl.Scan().ToArrowRecords(ctx)
for _, err := range itr {
require.NoError(t, err) // fails: "can't index multiple
deletion vectors for ..."
}
}
```
`newV3MergeOnReadTestTable` is a one-line variant of
[`newMergeOnReadTestTable`](https://github.com/debugmiller/iceberg-go/blob/da94843/table/mor_delete_pruning_test.go#L75-L96),
with `PropertyFormatVersion: "3"` instead of `"2"`.
### Expected vs. actual
- DVs referencing the data file after 2 deletes — expected **1** (merged
bitmap); actual **2** (both live)
- Scan after 2 deletes — expected: succeeds, 3 rows remain; actual: errors
with `can't index multiple deletion vectors for <path>`
### Root cause
-
[`Transaction.classifyFilesForFilteredDeletions`](https://github.com/debugmiller/iceberg-go/blob/da94843/table/transaction.go#L1543-L1663)
only walks data-file manifest entries — [line
1606](https://github.com/debugmiller/iceberg-go/blob/da94843/table/transaction.go#L1606)
skips anything that isn't `EntryContentData` — so it never looks up a data
file's existing DV.
-
[`positionDeleteRecordsToDataFilesDV`](https://github.com/debugmiller/iceberg-go/blob/da94843/table/arrow_utils.go#L1845-L1898)
always builds a fresh
[`dv.NewDVWriter()`](https://github.com/debugmiller/iceberg-go/blob/da94843/table/arrow_utils.go#L1847)
with no prior bitmap.
- The new DV is appended via
[`updater.appendDeleteFile(f)`](https://github.com/debugmiller/iceberg-go/blob/da94843/table/transaction.go#L1907)
without ever calling
[`updater.removeDeletionVector(...)`](https://github.com/debugmiller/iceberg-go/blob/da94843/table/snapshot_producers.go#L608-L616)
on the superseded one — even though that method already exists and is used
correctly by the compaction/`ReplaceFiles` path.
- The read side already guards against the invalid state —
[`buildDVIndex`](https://github.com/debugmiller/iceberg-go/blob/da94843/table/scanner.go#L561-L576)
rejects a second DV for the same referenced file — so the failure surfaces at
scan time, not commit time.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]