vustef commented on issue #1636:
URL: https://github.com/apache/iceberg-rust/issues/1636#issuecomment-3467155465

   In our fork, we've figured out what shape this changelog could look like to 
satisfy our needs. Here's a rough outline.
   
   We need to return two streams (with a possibility of combining them in a 
single stream). One stream is for appends (aka inserts), and the other one is 
for deletes. These streams represent changes between snapshot 
`from_snapshot_id` and `to_snapshot_id`, in the future let's call these 
snapshots snapshot `from_snapshot_id=A` and `to_snapshot_id=B`.
   
   The insert stream would contain `(_file, _pos, user_cols...)` (where 
`user_cols` are values of the actual columns for particular row). Note that 
adding `_file` and `_pos` metadata columns is an orthogonal feature, tracked 
with https://github.com/apache/iceberg-rust/issues/1766 and 
https://github.com/apache/iceberg-rust/issues/1765.
   
   The deletes stream should only contain `(file_path, pos)`, i.e. entries that 
appear already in the positional delete files. However, this stream should also 
contain entries from deleted data files, as well as resolve and translate 
equality deletes into such format.
   
   To produce inserts, we'd simply scan all the data files appearing between 
snapshots A and B, and apply positional and equality deletes, just like in 
regular `FileScanTask`s. Data files that are added >A, but deleted <=B, would 
be skipped.
   
   To produce deletes, we'd only consider deletes that refer to data files <=A. 
   
   Regarding the mode, the changes would be almost `net` changes. The deviation 
from that can happen in deletes, e.g. if we have a deletion of data file, which 
already had positional deletes applied. It seems computationally expensive to 
detect this, and there's no need to deduplicate deletes, we can always do that 
by post-processing the stream.
   
   Finally, I'm not sure what is the difference here between this and the 
incremental scan: https://github.com/apache/iceberg-rust/issues/1469. Initially 
that was only incremental append-only scan, but with the introduction of 
deletes, the difference between incremental scan and changelog scan is blurry 
to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to