C-Loftus opened a new issue, #1077:
URL: https://github.com/apache/iceberg-go/issues/1077

   ### Question
   
   Thank you for the great work on iceberg-go
   
   I am relatively new to iceberg and am trying to make it so DELETE operations 
cause as few rewrites as possible to the underlying parquet files. I have two 
questions:
   
   - is there any way to optimize that behavior currently other than using 
`WriteModeMergeOnRead`? 
   - Will the puffin vector delete metadata in V3 currently in progress improve 
this further / reduce rewrites, or does it just intended to speed up scans?
   
   ## Context
   
   I was hoping to somehow mark specific rows and just have the query engine 
skip them when querying, but is it the case that iceberg-go will try to rewrite 
the affected rows upon scanning over a relevant row? (i.e. I am sort of 
confused why `WriteModeMergeOnRead` would be useful other than just pushing off 
the write to occur later. Doesn't it still result in the same number of 
rewrites to the underlying parquet?
   
   I have a simple abbreviated example here (without err handling). It does 
work, but it is a bit unclear to me if there is a better way. 
   ```go
   cat, err := hadoop.NewCatalog("local-catalog", "/tmp/iceberg-warehouse", nil)
   
   tableIdent := catalog.ToIdentifier("default", "triples")
   
   tbl, err := cat.LoadTable(ctx, tableIdent)
   
   predicate := iceberg.EqualTo(
        iceberg.Reference("subject"),
        "foo",
   )
   
   newTable, err := tbl.Delete(ctx, predicate, nil)
   
   // VERIFY: scan to ensure no rows remain
   // Reload table to ensure latest snapshot
   tbl, err = cat.LoadTable(ctx, tableIdent)
   // NOTE: this seems to implicitly cause a rewrite
   scan := tbl.Scan(
        table.WithRowFilter(
                iceberg.EqualTo(
                        iceberg.Reference("subject"),
                        "foo",
                ),
        ),
        table.WithCaseSensitive(true),
   )
   _, records, err := scan.ToArrowRecords(ctx)
   // check the len of records here to ensure 0
   ```
   
   --- 
   
   _Note that I was going to ask this on Slack but I do not have an apache 
email so it seems I cannot join that community_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to