Re: Getting delta of data changes between 2 Snapshots

RD Thu, 25 Jul 2019 16:04:17 -0700

Thanks Ryan,

Iceberg can give you the data files that were added or deleted in a
> snapshot, but there isn't a good way to take those and actually read them
> as a DataFrame or select that data from a table in SQL. I'd think that's
> a good first step



One approach which I'm currently prototyping is this.

   I plan to expose an incrementalScan API as part of the Table API. The
IncrementalScan would take start and end snapshot ids, using which it could
construct a "diff" snapshot. This snapshot would essentially be composed of
all the data files which were added since the start snapshot id. This diff
snapshot would be a staging snapshot and would not be committed to the
table and can be asynchronously deleted [after being read]. We can then
read this snapshot using an existing planFiles method
[org.apache.iceberg.DataTableScan#planFiles].

If you think this approach is workable, I can create an issue and discuss
more there with the prototype.

-Best.


On Wed, Jul 17, 2019 at 12:55 PM Ryan Blue <rb...@netflix.com> wrote:

> I think it would be helpful to have a pattern for incremental processing.
> Iceberg can give you the data files that were added or deleted in a
> snapshot, but there isn't a good way to take those and actually read them
> as a DataFrame or select that data from a table in SQL. I'd think that's a
> good first step, but I'd also like to work on other concerns, like rollback
> and reprocessing. It would also be nice if such a system had a way to block
> snapshot expiration until all downstream incremental processes have
> completed.
>
> rb
>
> On Wed, Jul 17, 2019 at 12:46 PM RD <rdsr...@gmail.com> wrote:
>
>> Hi Iceberg devs,
>>    We are starting work on a somewhat similar project. The idea is that
>> users can ask for incremental data since the last snapshot they processed,
>> i.e the delta that was added since the last snapshot. Do you guys think
>> that whether this can be a general feature that can we beneficial to
>> Iceberg?
>>
>> -R
>>
>> On Wed, Jul 17, 2019 at 10:16 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> You can do this using time-travel. First, read the table at each
>>> snapshot. This creates a temporary table for both snapshots:
>>>
>>> // create temp tables for each snapshot
>>> spark.read.format("iceberg").option("snapshot-id", 
>>> 8924558786060583479L).load("db.table").createOrReplaceTempTable("s1")
>>> spark.read.format("iceberg").option("snapshot-id", 
>>> 6536733823181975045L).load("db.table").createOrReplaceTempTable("s2")
>>>
>>> Next, use a left-outer join and a filter to find the rows that are only
>>> in the second snapshot:
>>>
>>> SELECT s2.* FROM (SELECT * FROM s2 LEFT OUTER JOIN s1 ON s1.id = s2.id) 
>>> WHERE s1.id IS NULL
>>>
>>> Note that this is an expensive operation if you don’t also filter the
>>> snapshots. You probably want to do this with a small table, or you want to
>>> use filters to cut down on the data loaded that might match. For example,
>>> if you’re writing events as they come in from Kafka, you can filter each
>>> snapshot using the last day of data, then run the left outer join.
>>>
>>> On Wed, Jul 17, 2019 at 9:50 AM aa bb <abcd1...@yahoo.com.invalid>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Could you please advise how we can get delta data changes (diff)
>>>> between 2 Snapshots?
>>>>
>>>> Is there any way providing 2 Snapshot Ids (8924558786060583479,
>>>> 6536733823181975045) and get records that added after 8924558786060583479
>>>> ?
>>>>
>>>> +-------------------------+---------------------+---------------------+---------------------+
>>>> | made_current_at         | snapshot_id         | parent_id           | 
>>>> is_current_ancestor |
>>>> +-------------------------+---------------------+---------------------+---------------------+
>>>> | 2019-02-09 19:42:03.919 | 8924558786060583479 | 2999875608062437330 | 
>>>> true                |
>>>> | 2019-02-09 19:49:16.343 | 6536733823181975045 | 8924558786060583479 | 
>>>> true                |
>>>> +-------------------------+---------------------+---------------------+---------------------+
>>>>
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Getting delta of data changes between 2 Snapshots

Reply via email to