C-Loftus opened a new issue, #1077:
URL: https://github.com/apache/iceberg-go/issues/1077
### Question
Thank you for the great work on iceberg-go
I am relatively new to iceberg and am trying to make it so DELETE operations
cause as few rewrites as possible to the underlying parquet files. I have two
questions:
- is there any way to optimize that behavior currently other than using
`WriteModeMergeOnRead`?
- Will the puffin vector delete metadata in V3 currently in progress improve
this further / reduce rewrites, or does it just intended to speed up scans?
## Context
I was hoping to somehow mark specific rows and just have the query engine
skip them when querying, but is it the case that iceberg-go will try to rewrite
the affected rows upon scanning over a relevant row? (i.e. I am sort of
confused why `WriteModeMergeOnRead` would be useful other than just pushing off
the write to occur later. Doesn't it still result in the same number of
rewrites to the underlying parquet?
I have a simple abbreviated example here (without err handling). It does
work, but it is a bit unclear to me if there is a better way.
```go
cat, err := hadoop.NewCatalog("local-catalog", "/tmp/iceberg-warehouse", nil)
tableIdent := catalog.ToIdentifier("default", "triples")
tbl, err := cat.LoadTable(ctx, tableIdent)
predicate := iceberg.EqualTo(
iceberg.Reference("subject"),
"foo",
)
newTable, err := tbl.Delete(ctx, predicate, nil)
// VERIFY: scan to ensure no rows remain
// Reload table to ensure latest snapshot
tbl, err = cat.LoadTable(ctx, tableIdent)
// NOTE: this seems to implicitly cause a rewrite
scan := tbl.Scan(
table.WithRowFilter(
iceberg.EqualTo(
iceberg.Reference("subject"),
"foo",
),
),
table.WithCaseSensitive(true),
)
_, records, err := scan.ToArrowRecords(ctx)
// check the len of records here to ensure 0
```
---
_Note that I was going to ask this on Slack but I do not have an apache
email so it seems I cannot join that community_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]