qzyu999 commented on PR #3131:
URL: https://github.com/apache/iceberg-python/pull/3131#issuecomment-4146501702
Hi @kevinjqliu, apologies for the delay, thank you so much for taking the
time to review the PR again, I understand that you are quite busy. I've
addressed all your points in the latest set of tests within
33aaef0817b1366d80bcd8194a0c2ca5ba6f46f2. I've thoroughly expanded the tests to
integrate those requirements across a broad set of tests.
There are two minor issues I noticed however:
- Requirement: If the difference is due to prior soft-deletes, confirm those
delete files account for it
- This would require however that the `_RewriteFiles` be scoped to handle
Delete Manifests, but currently it's only set to handle Data Files. Handling
Delete Manifests would make it so that we could potentially do `REPLACE`
operations on deleted files. For example the purpose of this PR is to allow for
compaction of data files, but we could in theory also compact delete files for
the use case that someone has run many delete operations on many small files.
- I think this is definitely something to work on, but perhaps not in this
PR. The Java code seems to handle this well. I am thinking that after we merge
this `REPLACE`, we can next work on the data compaction issue. Then after that
we can come back to work on _RewriteFiles for Delete Manifests and work on
metadata compaction afterwards.
- Another more minor issue I noticed is that from running the tests and
doing `fast_append()` on files that are `DataFileContent.POSITION_DELETES`,
they're not yet being labeled properly as `ManifestContent.DELETES`. IIUC this
is due to the fact that `_SnapshotProducer._manifests()` (which `fast_append`
relies on under the hood) currently defaults to creating standard
`ManifestContent.DATA` writers. It doesn't yet inspect the incoming file's
content type to route `POSITION_DELETES` into a dedicated
`ManifestContent.DELETES` writer. I worked around this in my test by scanning
the manifest entries directly rather than relying on the manifest's label, but
I just wanted to flag it for the roadmap for when we build out full
Merge-on-Read write support.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]