tanmayrauth commented on issue #1207:
URL: https://github.com/apache/iceberg-go/issues/1207#issuecomment-4715150326
Thanks for the detailed write-up, and for offering to bring your downstream
CDC experience upstream, validating the generated delete files against Spark
and Trino is exactly the kind of grounding we want behind an equality-delete
API.
I'm broadly supportive of a thin convenience wrapper here. A few things
worth settling before we lock the shape:
On naming: I'd avoid DeleteEquality sitting right next to Delete(filter, …).
The two do very different things (predicate-based copy-on-write/merge-on-read
vs. commit-by-key delete files), and the similar names invite confusion.
Something like CommitEqualityDeletes or DeleteByEqualityRecords reads more
honestly. It's really a row-delta operation, not a filter delete.
On scope: this should stay pure sugar. The two-step flow you quoted is the
whole thing, so the wrapper should add no new behavior beyond composing
WriteEqualityDeletes + NewRowDelta().AddDeletes().Commit(). We already
validate the v2 requirement in both WriteEqualityDeletes and RowDelta.Commit,
and AddDeletes already checks content type and equality field ID, so the
convenience method inherits all of that for free and shouldn't introduce a
second validation path.
The part I'd actually like designed rather than deferred is conflict
detection. In Java the equality-delete path is deliberately not a one-liner,
precisely because correct CDC needs the RowDelta to carry validation —
validateFromSnapshot(...) plus validateNoConflictingDataFiles() /
validateNoConflictingDeleteFiles() — so a delete committed against a stale
snapshot doesn't silently miss rows a concurrent append added. An equality
delete applies to all data files with a lower sequence number, so under
concurrent writers a naive AddDeletes().Commit() can produce a correct-looking
snapshot that misses data. For a streaming engine running continuously, that's
the failure mode that bites in production, not the v2 check. So the question
I'd want answered in the proposal: does the convenience API expose (or default
to) the validation/conflict-detection knobs, or does it commit optimistically?
I'd rather expose snapshotProps and a conflict-detection option from day one
than ship som
ething that's only safe for single-writer workloads and have people reach for
it in exactly the concurrent CDC scenario where it isn't.
@zeroshade @laskoviymishka curious whether you'd want the
conflict-detection validation handled inside this convenience API or left to
the caller via the lower-level RowDelta flow, and whether our current RowDelta
validators already cover the stale-snapshot case or that's a gap to close
first. Want to make sure we align on that before @gerinsp starts coding.
If we scope the PR to the wrapper, explicit conflict-detection handling
(even if it means a small RowDelta addition), and your test matrix plus a
concurrent-writer test, I think this is very mergeable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]