qzyu999 commented on PR #3131:
URL: https://github.com/apache/iceberg-python/pull/3131#issuecomment-4146501702

   Hi @kevinjqliu, apologies for the delay, thank you so much for taking the 
time to review the PR again, I understand that you are quite busy. I've 
addressed all your points in the latest set of tests within 
33aaef0817b1366d80bcd8194a0c2ca5ba6f46f2. I've thoroughly expanded the tests to 
integrate those requirements across a broad set of tests.
   
   There are two minor issues I noticed however:
   - Requirement: If the difference is due to prior soft-deletes, confirm those 
delete files account for it
     - This would require however that the `_RewriteFiles` be scoped to handle 
Delete Manifests, but currently it's only set to handle Data Files. Handling 
Delete Manifests would make it so that we could potentially do `REPLACE` 
operations on deleted files. For example the purpose of this PR is to allow for 
compaction of data files, but we could in theory also compact delete files for 
the use case that someone has run many delete operations on many small files.
     - I think this is definitely something to work on, but perhaps not in this 
PR. The Java code seems to handle this well. I am thinking that after we merge 
this `REPLACE`, we can next work on the data compaction issue. Then after that 
we can come back to work on _RewriteFiles for Delete Manifests and work on 
metadata compaction afterwards.
   - Another more minor issue I noticed is that from running the tests and 
doing `fast_append()` on files that are `DataFileContent.POSITION_DELETES`, 
they're not yet being labeled properly as `ManifestContent.DELETES`. IIUC this 
is due to the fact that `_SnapshotProducer._manifests()` (which `fast_append` 
relies on under the hood) currently defaults to creating standard 
`ManifestContent.DATA` writers. It doesn't yet inspect the incoming file's 
content type to route `POSITION_DELETES` into a dedicated 
`ManifestContent.DELETES` writer. I worked around this in my test by scanning 
the manifest entries directly rather than relying on the manifest's label, but 
I just wanted to flag it for the roadmap for when we build out full 
Merge-on-Read write support.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to