vustef commented on issue #1636: URL: https://github.com/apache/iceberg-rust/issues/1636#issuecomment-3467155465
In our fork, we've figured out what shape this changelog could look like to satisfy our needs. Here's a rough outline. We need to return two streams (with a possibility of combining them in a single stream). One stream is for appends (aka inserts), and the other one is for deletes. These streams represent changes between snapshot `from_snapshot_id` and `to_snapshot_id`, in the future let's call these snapshots snapshot `from_snapshot_id=A` and `to_snapshot_id=B`. The insert stream would contain `(_file, _pos, user_cols...)` (where `user_cols` are values of the actual columns for particular row). Note that adding `_file` and `_pos` metadata columns is an orthogonal feature, tracked with https://github.com/apache/iceberg-rust/issues/1766 and https://github.com/apache/iceberg-rust/issues/1765. The deletes stream should only contain `(file_path, pos)`, i.e. entries that appear already in the positional delete files. However, this stream should also contain entries from deleted data files, as well as resolve and translate equality deletes into such format. To produce inserts, we'd simply scan all the data files appearing between snapshots A and B, and apply positional and equality deletes, just like in regular `FileScanTask`s. Data files that are added >A, but deleted <=B, would be skipped. To produce deletes, we'd only consider deletes that refer to data files <=A. Regarding the mode, the changes would be almost `net` changes. The deviation from that can happen in deletes, e.g. if we have a deletion of data file, which already had positional deletes applied. It seems computationally expensive to detect this, and there's no need to deduplicate deletes, we can always do that by post-processing the stream. Finally, I'm not sure what is the difference here between this and the incremental scan: https://github.com/apache/iceberg-rust/issues/1469. Initially that was only incremental append-only scan, but with the introduction of deletes, the difference between incremental scan and changelog scan is blurry to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
