Re: Getting delta of data changes between 2 Snapshots

Ryan Blue Wed, 17 Jul 2019 10:17:10 -0700

You can do this using time-travel. First, read the table at each snapshot.
This creates a temporary table for both snapshots:


// create temp tables for each snapshot
spark.read.format("iceberg").option("snapshot-id",
8924558786060583479L).load("db.table").createOrReplaceTempTable("s1")
spark.read.format("iceberg").option("snapshot-id",
6536733823181975045L).load("db.table").createOrReplaceTempTable("s2")

Next, use a left-outer join and a filter to find the rows that are only in
the second snapshot:

SELECT s2.* FROM (SELECT * FROM s2 LEFT OUTER JOIN s1 ON s1.id =
s2.id) WHERE s1.id IS NULL

Note that this is an expensive operation if you don’t also filter the
snapshots. You probably want to do this with a small table, or you want to
use filters to cut down on the data loaded that might match. For example,
if you’re writing events as they come in from Kafka, you can filter each
snapshot using the last day of data, then run the left outer join.

On Wed, Jul 17, 2019 at 9:50 AM aa bb <abcd1...@yahoo.com.invalid> wrote:

> Hi,
>
> Could you please advise how we can get delta data changes (diff) between 2
> Snapshots?
>
> Is there any way providing 2 Snapshot Ids (8924558786060583479,
> 6536733823181975045) and get records that added after 8924558786060583479
> ?
>
> +-------------------------+---------------------+---------------------+---------------------+
> | made_current_at         | snapshot_id         | parent_id           | 
> is_current_ancestor |
> +-------------------------+---------------------+---------------------+---------------------+
> | 2019-02-09 19:42:03.919 | 8924558786060583479 | 2999875608062437330 | true  
>               |
> | 2019-02-09 19:49:16.343 | 6536733823181975045 | 8924558786060583479 | true  
>               |
> +-------------------------+---------------------+---------------------+---------------------+
>
>
>
>
> Thanks,
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Getting delta of data changes between 2 Snapshots

Reply via email to