Re: Getting delta of data changes between 2 Snapshots

Ryan Blue Wed, 17 Jul 2019 12:55:37 -0700

I think it would be helpful to have a pattern for incremental processing.
Iceberg can give you the data files that were added or deleted in a
snapshot, but there isn't a good way to take those and actually read them
as a DataFrame or select that data from a table in SQL. I'd think that's a
good first step, but I'd also like to work on other concerns, like rollback
and reprocessing. It would also be nice if such a system had a way to block
snapshot expiration until all downstream incremental processes have
completed.


rb

On Wed, Jul 17, 2019 at 12:46 PM RD <rdsr...@gmail.com> wrote:

> Hi Iceberg devs,
>    We are starting work on a somewhat similar project. The idea is that
> users can ask for incremental data since the last snapshot they processed,
> i.e the delta that was added since the last snapshot. Do you guys think
> that whether this can be a general feature that can we beneficial to
> Iceberg?
>
> -R
>
> On Wed, Jul 17, 2019 at 10:16 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> You can do this using time-travel. First, read the table at each
>> snapshot. This creates a temporary table for both snapshots:
>>
>> // create temp tables for each snapshot
>> spark.read.format("iceberg").option("snapshot-id", 
>> 8924558786060583479L).load("db.table").createOrReplaceTempTable("s1")
>> spark.read.format("iceberg").option("snapshot-id", 
>> 6536733823181975045L).load("db.table").createOrReplaceTempTable("s2")
>>
>> Next, use a left-outer join and a filter to find the rows that are only
>> in the second snapshot:
>>
>> SELECT s2.* FROM (SELECT * FROM s2 LEFT OUTER JOIN s1 ON s1.id = s2.id) 
>> WHERE s1.id IS NULL
>>
>> Note that this is an expensive operation if you don’t also filter the
>> snapshots. You probably want to do this with a small table, or you want to
>> use filters to cut down on the data loaded that might match. For example,
>> if you’re writing events as they come in from Kafka, you can filter each
>> snapshot using the last day of data, then run the left outer join.
>>
>> On Wed, Jul 17, 2019 at 9:50 AM aa bb <abcd1...@yahoo.com.invalid> wrote:
>>
>>> Hi,
>>>
>>> Could you please advise how we can get delta data changes (diff) between
>>> 2 Snapshots?
>>>
>>> Is there any way providing 2 Snapshot Ids (8924558786060583479,
>>> 6536733823181975045) and get records that added after 8924558786060583479
>>> ?
>>>
>>> +-------------------------+---------------------+---------------------+---------------------+
>>> | made_current_at         | snapshot_id         | parent_id           | 
>>> is_current_ancestor |
>>> +-------------------------+---------------------+---------------------+---------------------+
>>> | 2019-02-09 19:42:03.919 | 8924558786060583479 | 2999875608062437330 | 
>>> true                |
>>> | 2019-02-09 19:49:16.343 | 6536733823181975045 | 8924558786060583479 | 
>>> true                |
>>> +-------------------------+---------------------+---------------------+---------------------+
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Getting delta of data changes between 2 Snapshots

Reply via email to