moomindani commented on issue #1078:
URL: 
https://github.com/apache/iceberg-python/issues/1078#issuecomment-4675503127

   Regarding the v3 side of this issue: the v3 spec requires position deletes 
to be written as deletion vectors in Puffin files, so the merge-on-read DELETE 
path needs a DV writer rather than position delete files. As @Fokko mentioned 
earlier in this thread, jumping directly to deletion vectors is an option — for 
v3 tables it is the required one, while the v2 position-delete-file discussion 
above stays independent of this.
   
   I have been working on building blocks for the v3 part (continuing #2822, 
coordinated with its original authors), and here is how I am currently thinking 
about it:
   
   1. #3474 (under review) adds a low-level PuffinWriter that serializes 
deletion-vector-v1 blobs, and #3476 adds an integration test verifying 
PyIceberg can read Spark-written DVs.
   2. Next, extend PuffinWriter to write one DV blob per referenced data file 
(mirroring Java's BaseDVFileWriter) and expose blob offsets/lengths, which are 
needed for DataFile's content_offset / content_size_in_bytes.
   3. Then wire this into the merge-on-read branch of Transaction.delete() for 
v3 tables: compute matched row positions per data file, union with an existing 
DV when present (at most one DV per data file), write one Puffin file per 
commit, and add the DV entries through a delete snapshot. Spark interop tests 
(Spark reading PyIceberg-written DVs) would come with this step.
   
   On the default question @kevinjqliu raised above: in this plan 
`write.delete.mode=merge-on-read` would only take effect on v3 tables, and 
copy-on-write stays the default for everything, so behavior does not change 
unless a user opts in.
   
   This is the approach I currently have in mind, but I am open to 
alternatives, a different split of the work, or aligning with anything already 
planned by the maintainers. Any feedback before I move forward with the 
follow-up PRs is very welcome.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to