Re: Getting delta of data changes between 2 Snapshots(Internet mail)

程力 Wed, 17 Jul 2019 18:56:54 -0700

Like having a system table to store in-use snapshot? Isn’t the incremental 
processing much like incremental pulling in Hudi?


-Li

发件人: Ryan Blue <rb...@netflix.com.INVALID>
答复: "dev@iceberg.apache.org" <dev@iceberg.apache.org>, "rb...@netflix.com" 
<rb...@netflix.com>
日期: 2019年7月18日 星期四 上午3:55
收件人: RD <rdsr...@gmail.com>
抄送: Iceberg Dev List <dev@iceberg.apache.org>
主题: Re: Getting delta of data changes between 2 Snapshots(Internet mail)

I think it would be helpful to have a pattern for incremental processing. 
Iceberg can give you the data files that were added or deleted in a snapshot, 
but there isn't a good way to take those and actually read them as a DataFrame 
or select that data from a table in SQL. I'd think that's a good first step, 
but I'd also like to work on other concerns, like rollback and reprocessing. It 
would also be nice if such a system had a way to block snapshot expiration 
until all downstream incremental processes have completed.

rb

On Wed, Jul 17, 2019 at 12:46 PM RD 
<rdsr...@gmail.com<mailto:rdsr...@gmail.com>> wrote:
Hi Iceberg devs,
   We are starting work on a somewhat similar project. The idea is that users 
can ask for incremental data since the last snapshot they processed, i.e the 
delta that was added since the last snapshot. Do you guys think that whether 
this can be a general feature that can we beneficial to Iceberg?

-R

On Wed, Jul 17, 2019 at 10:16 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

You can do this using time-travel. First, read the table at each snapshot. This 
creates a temporary table for both snapshots:

// create temp tables for each snapshot

spark.read.format("iceberg").option("snapshot-id", 
8924558786060583479L).load("db.table").createOrReplaceTempTable("s1")

spark.read.format("iceberg").option("snapshot-id", 
6536733823181975045L).load("db.table").createOrReplaceTempTable("s2")

Next, use a left-outer join and a filter to find the rows that are only in the 
second snapshot:

SELECT s2.* FROM (SELECT * FROM s2 LEFT OUTER JOIN s1 ON s1.id<http://s1.id> = 
s2.id<http://s2.id>) WHERE s1.id<http://s1.id> IS NULL

Note that this is an expensive operation if you don’t also filter the 
snapshots. You probably want to do this with a small table, or you want to use 
filters to cut down on the data loaded that might match. For example, if you’re 
writing events as they come in from Kafka, you can filter each snapshot using 
the last day of data, then run the left outer join.

On Wed, Jul 17, 2019 at 9:50 AM aa bb <abcd1...@yahoo.com.invalid> wrote:
Hi,

Could you please advise how we can get delta data changes (diff) between 2 
Snapshots?

Is there any way providing 2 Snapshot Ids (8924558786060583479, 
6536733823181975045) and get records that added after 8924558786060583479 ?

+-------------------------+---------------------+---------------------+---------------------+

| made_current_at         | snapshot_id         | parent_id           | 
is_current_ancestor |

+-------------------------+---------------------+---------------------+---------------------+

| 2019-02-09 19:42:03.919 | 8924558786060583479 | 2999875608062437330 | true    
            |

| 2019-02-09 19:49:16.343 | 6536733823181975045 | 8924558786060583479 | true    
            |

+-------------------------+---------------------+---------------------+---------------------+



Thanks,


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix

Re: Getting delta of data changes between 2 Snapshots(Internet mail)

Reply via email to