moomindani commented on issue #1078: URL: https://github.com/apache/iceberg-python/issues/1078#issuecomment-4675503127
Regarding the v3 side of this issue: the v3 spec requires position deletes to be written as deletion vectors in Puffin files, so the merge-on-read DELETE path needs a DV writer rather than position delete files. As @Fokko mentioned earlier in this thread, jumping directly to deletion vectors is an option — for v3 tables it is the required one, while the v2 position-delete-file discussion above stays independent of this. I have been working on building blocks for the v3 part (continuing #2822, coordinated with its original authors), and here is how I am currently thinking about it: 1. #3474 (under review) adds a low-level PuffinWriter that serializes deletion-vector-v1 blobs, and #3476 adds an integration test verifying PyIceberg can read Spark-written DVs. 2. Next, extend PuffinWriter to write one DV blob per referenced data file (mirroring Java's BaseDVFileWriter) and expose blob offsets/lengths, which are needed for DataFile's content_offset / content_size_in_bytes. 3. Then wire this into the merge-on-read branch of Transaction.delete() for v3 tables: compute matched row positions per data file, union with an existing DV when present (at most one DV per data file), write one Puffin file per commit, and add the DV entries through a delete snapshot. Spark interop tests (Spark reading PyIceberg-written DVs) would come with this step. On the default question @kevinjqliu raised above: in this plan `write.delete.mode=merge-on-read` would only take effect on v3 tables, and copy-on-write stays the default for everything, so behavior does not change unless a user opts in. This is the approach I currently have in mind, but I am open to alternatives, a different split of the work, or aligning with anything already planned by the maintainers. Any feedback before I move forward with the follow-up PRs is very welcome. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
