I think it would be helpful to have a pattern for incremental processing. Iceberg can give you the data files that were added or deleted in a snapshot, but there isn't a good way to take those and actually read them as a DataFrame or select that data from a table in SQL. I'd think that's a good first step, but I'd also like to work on other concerns, like rollback and reprocessing. It would also be nice if such a system had a way to block snapshot expiration until all downstream incremental processes have completed.
rb On Wed, Jul 17, 2019 at 12:46 PM RD <rdsr...@gmail.com> wrote: > Hi Iceberg devs, > We are starting work on a somewhat similar project. The idea is that > users can ask for incremental data since the last snapshot they processed, > i.e the delta that was added since the last snapshot. Do you guys think > that whether this can be a general feature that can we beneficial to > Iceberg? > > -R > > On Wed, Jul 17, 2019 at 10:16 AM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> You can do this using time-travel. First, read the table at each >> snapshot. This creates a temporary table for both snapshots: >> >> // create temp tables for each snapshot >> spark.read.format("iceberg").option("snapshot-id", >> 8924558786060583479L).load("db.table").createOrReplaceTempTable("s1") >> spark.read.format("iceberg").option("snapshot-id", >> 6536733823181975045L).load("db.table").createOrReplaceTempTable("s2") >> >> Next, use a left-outer join and a filter to find the rows that are only >> in the second snapshot: >> >> SELECT s2.* FROM (SELECT * FROM s2 LEFT OUTER JOIN s1 ON s1.id = s2.id) >> WHERE s1.id IS NULL >> >> Note that this is an expensive operation if you don’t also filter the >> snapshots. You probably want to do this with a small table, or you want to >> use filters to cut down on the data loaded that might match. For example, >> if you’re writing events as they come in from Kafka, you can filter each >> snapshot using the last day of data, then run the left outer join. >> >> On Wed, Jul 17, 2019 at 9:50 AM aa bb <abcd1...@yahoo.com.invalid> wrote: >> >>> Hi, >>> >>> Could you please advise how we can get delta data changes (diff) between >>> 2 Snapshots? >>> >>> Is there any way providing 2 Snapshot Ids (8924558786060583479, >>> 6536733823181975045) and get records that added after 8924558786060583479 >>> ? >>> >>> +-------------------------+---------------------+---------------------+---------------------+ >>> | made_current_at | snapshot_id | parent_id | >>> is_current_ancestor | >>> +-------------------------+---------------------+---------------------+---------------------+ >>> | 2019-02-09 19:42:03.919 | 8924558786060583479 | 2999875608062437330 | >>> true | >>> | 2019-02-09 19:49:16.343 | 6536733823181975045 | 8924558786060583479 | >>> true | >>> +-------------------------+---------------------+---------------------+---------------------+ >>> >>> >>> >>> >>> Thanks, >>> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > -- Ryan Blue Software Engineer Netflix