First, thank you all for your responses to my question. For Peter's question, I believe that (b) is the correct behavior. It is also the current behavior when using copy-on-write (deletes and updates are still supported but not using delete files). A changelog scan is an incremental scan over multiple snapshots. It should emit changes for each snapshot in the requested range. Spark provides additional functionality on top of the changelog scan, to produce net changes for the requested range. See https://iceberg.apache.org/docs/latest/spark-procedures/#create_changelog_view. Basically the create_changelog_view procedure uses a changelog scan (read the changelog table, i.e., <table>.changes) to get a DataFrame which is saved to a temporary Spark view which can then be queried; if net_changes is true, only the net changes are produced for this temporary view. This functionality uses ChangelogIterator.removeNetCarryovers (which is in Spark).
On Thu, Aug 22, 2024 at 7:51 AM Steven Wu <stevenz...@gmail.com> wrote: > Peter, good question. In this case, (b) is the complete change history. > (a) is the squashed version. > > I would probably check how other changelog systems deal with this scenario. > > On Thu, Aug 22, 2024 at 3:49 AM Péter Váry <peter.vary.apa...@gmail.com> > wrote: > >> Technically different, but somewhat similar question: >> >> What is the expected behaviour when the `IncrementalScan` is created for >> not a single snapshot, but for multiple snapshots? >> S1 added PK1-V1 >> S2 updated PK1-V1 to PK1-V1b (removed PK1-V1 and added PK1-V1b) >> S3 updated PK1-V1b to PK1-V1c (removed PK1-V1b and added PK1-V1c) >> >> Let's say we have >> *IncrementalScan.fromSnapshotInclusive(S1).toSnapshot(S3)*. >> Or we need to return: >> (a) >> - PK1,V1c,INSERTED >> >> Or is it ok, to return: >> (b) >> - PK1,V1,INSERTED >> - PK1,V1,DELETED >> - PK1,V1b,INSERTED >> - PK1,V1b,DELETED >> - PK1,V1c,INSERTED >> >> I think the (a) is the correct behaviour. >> >> Thanks, >> Peter >> >> Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2024. aug. 21., Sze, >> 22:27): >> >>> Agree with everyone that option (a) is the correct behavior. >>> >>> On Wed, Aug 21, 2024 at 11:57 AM Steve Zhang >>> <hongyue_zh...@apple.com.invalid> wrote: >>> >>>> I agree that option (a) is what user expects for row level changes. >>>> >>>> I feel the added deletes in given snapshots provides a PK of DELETED >>>> entry, existing deletes are used to read together with data files to find >>>> DELETED value (V1b) and result of columns. >>>> >>>> Thanks, >>>> Steve Zhang >>>> >>>> >>>> >>>> On Aug 20, 2024, at 6:06 PM, Wing Yew Poon <wyp...@cloudera.com.INVALID> >>>> wrote: >>>> >>>> Hi, >>>> >>>> I have a PR open to add changelog support for the case where delete >>>> files are present (https://github.com/apache/iceberg/pull/10935). I >>>> have a question about what the changelog should emit in the following >>>> scenario: >>>> >>>> The table has a schema with a primary key/identifier column PK and >>>> additional column V. >>>> In snapshot 1, we write a data file DF1 with rows >>>> PK1, V1 >>>> PK2, V2 >>>> etc. >>>> In snapshot 2, we write an equality delete file ED1 with PK=PK1, and >>>> new data file DF2 with rows >>>> PK1, V1b >>>> (possibly other rows) >>>> In snapshot 3, we write an equality delete file ED2 with PK=PK1, and >>>> new data file DF3 with rows >>>> PK1, V1c >>>> (possibly other rows) >>>> >>>> Thus, in snapshot 2 and snapshot 3, we update the row identified by PK1 >>>> with new values by using an equality delete and writing new data for the >>>> row. >>>> These are the files present in snapshot 3: >>>> DF1 (sequence number 1) >>>> DF2 (sequence number 2) >>>> DF3 (sequence number 3) >>>> ED1 (sequence number 2) >>>> ED2 (sequence number 3) >>>> >>>> The question I have is what should the changelog emit for snapshot 3? >>>> For snapshot 1, the changelog should emit a row for each row in DF1 as >>>> INSERTED. >>>> For snapshot 2, it should emit a row for PK1, V1 as DELETED; and a row >>>> for PK1, V1b as INSERTED. >>>> For snapshot 3, I see two possibilities: >>>> (a) >>>> PK1,V1b,DELETED >>>> PK1,V1c,INSERTED >>>> >>>> (b) >>>> PK1,V1,DELETED >>>> PK1,V1b,DELETED >>>> PK1,V1c,INSERTED >>>> >>>> The interpretation for (b) is that both ED1 and ED2 apply to DF1, with >>>> ED1 being an existing delete file and ED2 being an added delete file for >>>> it. We discount ED1 and apply ED2 and get a DELETED row for PK1,V1. >>>> ED2 also applies to DF2, from which we get a DELETED row for PK1, V1b. >>>> >>>> The interpretation for (a) is that ED1 is an existing delete file for >>>> DF1 and in snapshot 3, the row PK1,V1 already does not exist before the >>>> snapshot. Thus we do emit a row for it. (We can think of it as ED1 is >>>> already applied to DF1, and we only consider any additional rows that get >>>> deleted when ED2 is applied.) >>>> >>>> I lean towards (a), as I think it is more reflective of net changes. >>>> I am interested to hear what folks think. >>>> >>>> Thank you, >>>> Wing Yew >>>> >>>> >>>> >>>>