tanmayrauth commented on issue #1207:
URL: https://github.com/apache/iceberg-go/issues/1207#issuecomment-4715150326

   Thanks for the detailed write-up, and for offering to bring your downstream 
CDC experience upstream,  validating the generated delete files against Spark 
and Trino is exactly the kind of grounding we want  behind an equality-delete 
API.
   
   I'm broadly supportive of a thin convenience wrapper here. A few things 
worth settling before we lock the shape:
   
   On naming: I'd avoid DeleteEquality sitting right next to Delete(filter, …). 
The two do very different things (predicate-based copy-on-write/merge-on-read 
vs. commit-by-key delete files), and the similar  names invite confusion. 
Something like CommitEqualityDeletes or DeleteByEqualityRecords reads more 
honestly. It's really a row-delta operation, not a filter delete.
    
   On scope: this should stay pure sugar. The two-step flow you quoted is the 
whole thing, so the wrapper should add no new behavior beyond composing 
WriteEqualityDeletes +  NewRowDelta().AddDeletes().Commit(). We already 
validate the v2 requirement in both WriteEqualityDeletes and RowDelta.Commit, 
and AddDeletes already checks content type and equality field ID, so the  
convenience method inherits all of that for free and shouldn't introduce a 
second validation path.  
   
   The part I'd actually like designed rather than deferred is conflict 
detection. In Java the equality-delete path is deliberately not a one-liner, 
precisely because correct CDC needs the RowDelta to carry  validation — 
validateFromSnapshot(...) plus validateNoConflictingDataFiles() / 
validateNoConflictingDeleteFiles() — so a delete committed against a stale 
snapshot doesn't silently miss rows a concurrent append added. An equality 
delete applies to all data files with a lower sequence number, so under 
concurrent writers a naive AddDeletes().Commit() can produce a correct-looking 
snapshot that misses data.  For a streaming engine running continuously, that's 
the failure mode that bites in production, not the v2 check. So the question 
I'd want answered in the proposal: does the convenience API expose (or  default 
to) the validation/conflict-detection knobs, or does it commit optimistically? 
I'd rather expose snapshotProps and a conflict-detection option from day one 
than ship som
 ething that's only safe for single-writer workloads and have people reach for 
it in exactly the concurrent CDC scenario where it isn't.
   
    
   
   @zeroshade @laskoviymishka  curious whether you'd want the 
conflict-detection validation handled inside this convenience API or left to 
the caller via the lower-level RowDelta flow, and whether our current RowDelta 
validators already cover the stale-snapshot case or that's a gap to close 
first. Want to make sure we align on that before @gerinsp starts coding.
   
    If we scope the PR to the wrapper, explicit conflict-detection handling 
(even if it means a small RowDelta addition), and your test matrix plus a 
concurrent-writer test, I think this is very mergeable.  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to