Thanks Ryan, Iceberg can give you the data files that were added or deleted in a > snapshot, but there isn't a good way to take those and actually read them > as a DataFrame or select that data from a table in SQL. I'd think that's > a good first step
One approach which I'm currently prototyping is this. I plan to expose an incrementalScan API as part of the Table API. The IncrementalScan would take start and end snapshot ids, using which it could construct a "diff" snapshot. This snapshot would essentially be composed of all the data files which were added since the start snapshot id. This diff snapshot would be a staging snapshot and would not be committed to the table and can be asynchronously deleted [after being read]. We can then read this snapshot using an existing planFiles method [org.apache.iceberg.DataTableScan#planFiles]. If you think this approach is workable, I can create an issue and discuss more there with the prototype. -Best. On Wed, Jul 17, 2019 at 12:55 PM Ryan Blue <rb...@netflix.com> wrote: > I think it would be helpful to have a pattern for incremental processing. > Iceberg can give you the data files that were added or deleted in a > snapshot, but there isn't a good way to take those and actually read them > as a DataFrame or select that data from a table in SQL. I'd think that's a > good first step, but I'd also like to work on other concerns, like rollback > and reprocessing. It would also be nice if such a system had a way to block > snapshot expiration until all downstream incremental processes have > completed. > > rb > > On Wed, Jul 17, 2019 at 12:46 PM RD <rdsr...@gmail.com> wrote: > >> Hi Iceberg devs, >> We are starting work on a somewhat similar project. The idea is that >> users can ask for incremental data since the last snapshot they processed, >> i.e the delta that was added since the last snapshot. Do you guys think >> that whether this can be a general feature that can we beneficial to >> Iceberg? >> >> -R >> >> On Wed, Jul 17, 2019 at 10:16 AM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> You can do this using time-travel. First, read the table at each >>> snapshot. This creates a temporary table for both snapshots: >>> >>> // create temp tables for each snapshot >>> spark.read.format("iceberg").option("snapshot-id", >>> 8924558786060583479L).load("db.table").createOrReplaceTempTable("s1") >>> spark.read.format("iceberg").option("snapshot-id", >>> 6536733823181975045L).load("db.table").createOrReplaceTempTable("s2") >>> >>> Next, use a left-outer join and a filter to find the rows that are only >>> in the second snapshot: >>> >>> SELECT s2.* FROM (SELECT * FROM s2 LEFT OUTER JOIN s1 ON s1.id = s2.id) >>> WHERE s1.id IS NULL >>> >>> Note that this is an expensive operation if you don’t also filter the >>> snapshots. You probably want to do this with a small table, or you want to >>> use filters to cut down on the data loaded that might match. For example, >>> if you’re writing events as they come in from Kafka, you can filter each >>> snapshot using the last day of data, then run the left outer join. >>> >>> On Wed, Jul 17, 2019 at 9:50 AM aa bb <abcd1...@yahoo.com.invalid> >>> wrote: >>> >>>> Hi, >>>> >>>> Could you please advise how we can get delta data changes (diff) >>>> between 2 Snapshots? >>>> >>>> Is there any way providing 2 Snapshot Ids (8924558786060583479, >>>> 6536733823181975045) and get records that added after 8924558786060583479 >>>> ? >>>> >>>> +-------------------------+---------------------+---------------------+---------------------+ >>>> | made_current_at | snapshot_id | parent_id | >>>> is_current_ancestor | >>>> +-------------------------+---------------------+---------------------+---------------------+ >>>> | 2019-02-09 19:42:03.919 | 8924558786060583479 | 2999875608062437330 | >>>> true | >>>> | 2019-02-09 19:49:16.343 | 6536733823181975045 | 8924558786060583479 | >>>> true | >>>> +-------------------------+---------------------+---------------------+---------------------+ >>>> >>>> >>>> >>>> >>>> Thanks, >>>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix >