Re: [DISCUSS] Write-audit-publish support

2019-11-11 Thread Miao Wang
From a timeline perspective, we can’t work on implementing this feature in next 
a couple of months. For short term workaround, we choose a lock mechanism at 
application level.

@Anton Okolnychyi<mailto:aokolnyc...@apple.com.INVALID> If you can pick up this 
feature, it will be great!

Thanks!

Miao

From: Ryan Blue 
Reply-To: "dev@iceberg.apache.org" , 
"rb...@netflix.com" 
Date: Monday, November 11, 2019 at 11:54 AM
To: Anton Okolnychyi 
Cc: Iceberg Dev List , Ashish Mehta 

Subject: Re: [DISCUSS] Write-audit-publish support

I just had a direct request for this over the weekend, too. I opened #629 Add 
cherry-pick 
operation<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fissues%2F629=02%7C01%7Cmiwang%40adobe.com%7C9073f8097d9f46403ce608d766e1022f%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637090988904592246=XsPllVj3l5DZeMDrI248W2timQywQXNjpbRSg9nMppg%3D=0>
 to track this.

On Mon, Nov 11, 2019 at 1:43 AM Anton Okolnychyi 
mailto:aokolnyc...@apple.com>> wrote:
We would be interested in this functionality as well. We have a use case with 
multiple concurrent writers where we wanted to use WAP but couldn’t.


On 9 Nov 2019, at 01:32, Ryan Blue 
mailto:rb...@netflix.com.INVALID>> wrote:

Right now, there isn't a good way to manage multiple pending writes. Snapshots 
from each write are created based on the current table state, so simply moving 
to one of two pending commits would mean you ignore the changes in the other 
pending commit. We've considered adding a "cherry-pick" operation that can take 
the changes from one snapshot and apply them on top of another to solve that 
problem. If you'd like to implement that, I'd be happy to review it!

On Fri, Nov 8, 2019 at 3:29 PM Ashish Mehta 
mailto:mehta.ashis...@gmail.com>> wrote:
Thanks Ryan, that worked out. Since its a rollback, I wonder how can user stage 
multiple WAP snapshots, and commit then in any order, based on how Audit 
process work out?
I wonder this expectation, goes against the underlying principles of Iceberg.

Thanks,
Ashish

On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue 
mailto:rb...@netflix.com.invalid>> wrote:

Ashish, you can use the rollback table operation to set a particular snapshot 
as the current table state. Like this:

Table table = hiveCatalog.load(name);

table.rollback().toSnapshotId(id).commmit();

On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta 
mailto:mehta.ashis...@gmail.com>> wrote:
Hi Ryan,

Can you please help me point to doc, where I can find how to publish a WAP 
snapshot? I am able to filter the snapshot, based on 
wap.id<https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwap.id%2F=02%7C01%7Cmiwang%40adobe.com%7C9073f8097d9f46403ce608d766e1022f%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637090988904602247=UO%2Fc%2Bz2pqZqUrllKAbAsCA%2Bg5B1MnJmF3ysl9JuLqv0%3D=0>
 in summary of Snapshot, but clueless the official recommendation on committing 
that snapshot. I can think of cherry-picking Appended/Deleted files, but don't 
know the nuances of missing something important with this.

Thanks,
-Ashish

-- Forwarded message -
From: Ryan Blue mailto:rb...@netflix.com.invalid>>
Date: Wed, Jul 31, 2019 at 4:41 PM
Subject: Re: [DISCUSS] Write-audit-publish support
To: Edgar Rodriguez 
mailto:edgar.rodrig...@airbnb.com>>
Cc: Iceberg Dev List mailto:dev@iceberg.apache.org>>, 
Anton Okolnychyi mailto:aokolnyc...@apple.com>>

Hi everyone, I've added PR 
#342<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fpull%2F342=02%7C01%7Cmiwang%40adobe.com%7C9073f8097d9f46403ce608d766e1022f%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637090988904602247=dzwyt0wOGwoIsVDGyUapQn2S%2F%2FhWsxdiBOnfdj2GClA%3D=0>
 to the Iceberg repository with our WAP changes. Please have a look if you were 
interested in this.

On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez 
mailto:edgar.rodrig...@airbnb.com>> wrote:
I think this use case is pretty helpful in most data environments, we do the 
same sort of stage-check-publish pattern to run quality checks.
One question is, if say the audit part fails, is there a way to expire the 
snapshot or what would be the workflow that follows?

Best,
Edgar

On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee 
mailto:moulimukher...@gmail.com>> wrote:
This would be super helpful. We have a similar workflow where we do some 
validation before letting the downstream consume the changes.

Best,
Mouli

On Mon, Jul 22, 2019 at 9:18 AM Filip 
mailto:filip@gmail.com>> wrote:
This definitely sounds interesting. Quick question on whether this presents 
impact on the current Upserts spec? Or is it maybe that we are looking to 
associate this support for append-only commits?

On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue 
mailto:rb...@netflix.com.invalid>> wr

Re: [DISCUSS] Write-audit-publish support

2019-11-11 Thread Ryan Blue
I just had a direct request for this over the weekend, too. I opened #629
Add cherry-pick operation
<https://github.com/apache/incubator-iceberg/issues/629> to track this.

On Mon, Nov 11, 2019 at 1:43 AM Anton Okolnychyi 
wrote:

> We would be interested in this functionality as well. We have a use case
> with multiple concurrent writers where we wanted to use WAP but couldn’t.
>
> On 9 Nov 2019, at 01:32, Ryan Blue  wrote:
>
> Right now, there isn't a good way to manage multiple pending writes.
> Snapshots from each write are created based on the current table state, so
> simply moving to one of two pending commits would mean you ignore the
> changes in the other pending commit. We've considered adding a
> "cherry-pick" operation that can take the changes from one snapshot and
> apply them on top of another to solve that problem. If you'd like to
> implement that, I'd be happy to review it!
>
> On Fri, Nov 8, 2019 at 3:29 PM Ashish Mehta 
> wrote:
>
>> Thanks Ryan, that worked out. Since its a rollback, I wonder how can user
>> stage multiple WAP snapshots, and commit then in any order, based on how
>> Audit process work out?
>> I wonder this expectation, goes against the underlying principles of
>> Iceberg.
>>
>> Thanks,
>> Ashish
>>
>> On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue 
>> wrote:
>>
>>> Ashish, you can use the rollback table operation to set a particular
>>> snapshot as the current table state. Like this:
>>>
>>> Table table = hiveCatalog.load(name);
>>> table.rollback().toSnapshotId(id).commmit();
>>>
>>>
>>> On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta 
>>> wrote:
>>>
>>>> Hi Ryan,
>>>>
>>>> Can you please help me point to doc, where I can find how to publish a
>>>> WAP snapshot? I am able to filter the snapshot, based on wap.id in
>>>> summary of Snapshot, but clueless the official recommendation on
>>>> committing that snapshot. I can think of cherry-picking Appended/Deleted
>>>> files, but don't know the nuances of missing something important with this.
>>>>
>>>> Thanks,
>>>> -Ashish
>>>>
>>>>
>>>>> -- Forwarded message -
>>>>> From: Ryan Blue 
>>>>> Date: Wed, Jul 31, 2019 at 4:41 PM
>>>>> Subject: Re: [DISCUSS] Write-audit-publish support
>>>>> To: Edgar Rodriguez 
>>>>> Cc: Iceberg Dev List , Anton Okolnychyi <
>>>>> aokolnyc...@apple.com>
>>>>>
>>>>>
>>>>> Hi everyone, I've added PR #342
>>>>> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg
>>>>> repository with our WAP changes. Please have a look if you were interested
>>>>> in this.
>>>>>
>>>>> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <
>>>>> edgar.rodrig...@airbnb.com> wrote:
>>>>>
>>>>>> I think this use case is pretty helpful in most data environments, we
>>>>>> do the same sort of stage-check-publish pattern to run quality checks.
>>>>>> One question is, if say the audit part fails, is there a way to
>>>>>> expire the snapshot or what would be the workflow that follows?
>>>>>>
>>>>>> Best,
>>>>>> Edgar
>>>>>>
>>>>>> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <
>>>>>> moulimukher...@gmail.com> wrote:
>>>>>>
>>>>>>> This would be super helpful. We have a similar workflow where we do
>>>>>>> some validation before letting the downstream consume the changes.
>>>>>>>
>>>>>>> Best,
>>>>>>> Mouli
>>>>>>>
>>>>>>> On Mon, Jul 22, 2019 at 9:18 AM Filip  wrote:
>>>>>>>
>>>>>>>> This definitely sounds interesting. Quick question on whether this
>>>>>>>> presents impact on the current Upserts spec? Or is it maybe that we are
>>>>>>>> looking to associate this support for append-only commits?
>>>>>>>>
>>>>>>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <
>>>>>>>> rb...@netflix.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Audits run on the snapshot by setting the snapshot-id read option
>>&

Re: [DISCUSS] Write-audit-publish support

2019-11-11 Thread Anton Okolnychyi
We would be interested in this functionality as well. We have a use case with 
multiple concurrent writers where we wanted to use WAP but couldn’t.

> On 9 Nov 2019, at 01:32, Ryan Blue  wrote:
> 
> Right now, there isn't a good way to manage multiple pending writes. 
> Snapshots from each write are created based on the current table state, so 
> simply moving to one of two pending commits would mean you ignore the changes 
> in the other pending commit. We've considered adding a "cherry-pick" 
> operation that can take the changes from one snapshot and apply them on top 
> of another to solve that problem. If you'd like to implement that, I'd be 
> happy to review it!
> 
> On Fri, Nov 8, 2019 at 3:29 PM Ashish Mehta  <mailto:mehta.ashis...@gmail.com>> wrote:
> Thanks Ryan, that worked out. Since its a rollback, I wonder how can user 
> stage multiple WAP snapshots, and commit then in any order, based on how 
> Audit process work out?
> I wonder this expectation, goes against the underlying principles of Iceberg. 
> 
> Thanks,
> Ashish
> 
> On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue  wrote:
> Ashish, you can use the rollback table operation to set a particular snapshot 
> as the current table state. Like this:
> 
> Table table = hiveCatalog.load(name);
> table.rollback().toSnapshotId(id).commmit();
> 
> On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta  <mailto:mehta.ashis...@gmail.com>> wrote:
> Hi Ryan, 
> 
> Can you please help me point to doc, where I can find how to publish a WAP 
> snapshot? I am able to filter the snapshot, based on wap.id <http://wap.id/> 
> in summary of Snapshot, but clueless the official recommendation on 
> committing that snapshot. I can think of cherry-picking Appended/Deleted 
> files, but don't know the nuances of missing something important with this.
> 
> Thanks,
> -Ashish
>  
> -- Forwarded message -
> From: Ryan Blue 
> Date: Wed, Jul 31, 2019 at 4:41 PM
> Subject: Re: [DISCUSS] Write-audit-publish support
> To: Edgar Rodriguez  <mailto:edgar.rodrig...@airbnb.com>>
> Cc: Iceberg Dev List  <mailto:dev@iceberg.apache.org>>, Anton Okolnychyi  <mailto:aokolnyc...@apple.com>>
> 
> 
> Hi everyone, I've added PR #342 
> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg 
> repository with our WAP changes. Please have a look if you were interested in 
> this.
> 
> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez  <mailto:edgar.rodrig...@airbnb.com>> wrote:
> I think this use case is pretty helpful in most data environments, we do the 
> same sort of stage-check-publish pattern to run quality checks. 
> One question is, if say the audit part fails, is there a way to expire the 
> snapshot or what would be the workflow that follows?
> 
> Best,
> Edgar
> 
> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee  <mailto:moulimukher...@gmail.com>> wrote:
> This would be super helpful. We have a similar workflow where we do some 
> validation before letting the downstream consume the changes.
> 
> Best,
> Mouli
> 
> On Mon, Jul 22, 2019 at 9:18 AM Filip  <mailto:filip@gmail.com>> wrote:
> This definitely sounds interesting. Quick question on whether this presents 
> impact on the current Upserts spec? Or is it maybe that we are looking to 
> associate this support for append-only commits?
> 
> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue  wrote:
> Audits run on the snapshot by setting the snapshot-id read option to read the 
> WAP snapshot, even though it has not (yet) been the current table state. This 
> is documented in the time travel 
> <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg site.
> 
> We added a stageOnly method to SnapshotProducer that adds the snapshot to 
> table metadata, but does not make it the current table state. That is called 
> by the Spark writer when there is a WAP ID, and that ID is embedded in the 
> staged snapshot’s metadata so processes can find it.
> 
> I'll add a PR with this code, since there is interest.
> 
> rb
> 
> 
> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi  <mailto:aokolnyc...@apple.com>> wrote:
> I would also support adding this to Iceberg itself. I think we have a use 
> case where we can leverage this.
> 
> @Ryan, could you also provide more info on the audit process?
> 
> Thanks,
> Anton
> 
>> On 20 Jul 2019, at 04:01, RD mailto:rdsr...@gmail.com>> 
>> wrote:
>> 
>> I think this could be useful. When we ingest data from Kafka, we do a 
>> predefined set of checks on the data. We can potentially utilize something 
>> like this to check for sanit

Re: [DISCUSS] Write-audit-publish support

2019-11-08 Thread Ryan Blue
Right now, there isn't a good way to manage multiple pending writes.
Snapshots from each write are created based on the current table state, so
simply moving to one of two pending commits would mean you ignore the
changes in the other pending commit. We've considered adding a
"cherry-pick" operation that can take the changes from one snapshot and
apply them on top of another to solve that problem. If you'd like to
implement that, I'd be happy to review it!

On Fri, Nov 8, 2019 at 3:29 PM Ashish Mehta 
wrote:

> Thanks Ryan, that worked out. Since its a rollback, I wonder how can user
> stage multiple WAP snapshots, and commit then in any order, based on how
> Audit process work out?
> I wonder this expectation, goes against the underlying principles of
> Iceberg.
>
> Thanks,
> Ashish
>
> On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue 
> wrote:
>
>> Ashish, you can use the rollback table operation to set a particular
>> snapshot as the current table state. Like this:
>>
>> Table table = hiveCatalog.load(name);
>> table.rollback().toSnapshotId(id).commmit();
>>
>>
>> On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta 
>> wrote:
>>
>>> Hi Ryan,
>>>
>>> Can you please help me point to doc, where I can find how to publish a
>>> WAP snapshot? I am able to filter the snapshot, based on wap.id in
>>> summary of Snapshot, but clueless the official recommendation on
>>> committing that snapshot. I can think of cherry-picking Appended/Deleted
>>> files, but don't know the nuances of missing something important with this.
>>>
>>> Thanks,
>>> -Ashish
>>>
>>>
>>>> -- Forwarded message -
>>>> From: Ryan Blue 
>>>> Date: Wed, Jul 31, 2019 at 4:41 PM
>>>> Subject: Re: [DISCUSS] Write-audit-publish support
>>>> To: Edgar Rodriguez 
>>>> Cc: Iceberg Dev List , Anton Okolnychyi <
>>>> aokolnyc...@apple.com>
>>>>
>>>>
>>>> Hi everyone, I've added PR #342
>>>> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg
>>>> repository with our WAP changes. Please have a look if you were interested
>>>> in this.
>>>>
>>>> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <
>>>> edgar.rodrig...@airbnb.com> wrote:
>>>>
>>>>> I think this use case is pretty helpful in most data environments, we
>>>>> do the same sort of stage-check-publish pattern to run quality checks.
>>>>> One question is, if say the audit part fails, is there a way to expire
>>>>> the snapshot or what would be the workflow that follows?
>>>>>
>>>>> Best,
>>>>> Edgar
>>>>>
>>>>> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <
>>>>> moulimukher...@gmail.com> wrote:
>>>>>
>>>>>> This would be super helpful. We have a similar workflow where we do
>>>>>> some validation before letting the downstream consume the changes.
>>>>>>
>>>>>> Best,
>>>>>> Mouli
>>>>>>
>>>>>> On Mon, Jul 22, 2019 at 9:18 AM Filip  wrote:
>>>>>>
>>>>>>> This definitely sounds interesting. Quick question on whether this
>>>>>>> presents impact on the current Upserts spec? Or is it maybe that we are
>>>>>>> looking to associate this support for append-only commits?
>>>>>>>
>>>>>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Audits run on the snapshot by setting the snapshot-id read option
>>>>>>>> to read the WAP snapshot, even though it has not (yet) been the current
>>>>>>>> table state. This is documented in the time travel
>>>>>>>> <http://iceberg.apache.org/spark/#time-travel> section of the
>>>>>>>> Iceberg site.
>>>>>>>>
>>>>>>>> We added a stageOnly method to SnapshotProducer that adds the
>>>>>>>> snapshot to table metadata, but does not make it the current table 
>>>>>>>> state.
>>>>>>>> That is called by the Spark writer when there is a WAP ID, and that ID 
>>>>>>>> is
>>>>>>>> embedded in the staged snapshot’s metadata so processes can find it.
>>>>>

Re: [DISCUSS] Write-audit-publish support

2019-11-08 Thread Ashish Mehta
Thanks Ryan, that worked out. Since its a rollback, I wonder how can user
stage multiple WAP snapshots, and commit then in any order, based on how
Audit process work out?
I wonder this expectation, goes against the underlying principles of
Iceberg.

Thanks,
Ashish

On Fri, Nov 8, 2019 at 2:44 PM Ryan Blue  wrote:

> Ashish, you can use the rollback table operation to set a particular
> snapshot as the current table state. Like this:
>
> Table table = hiveCatalog.load(name);
> table.rollback().toSnapshotId(id).commmit();
>
>
> On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta 
> wrote:
>
>> Hi Ryan,
>>
>> Can you please help me point to doc, where I can find how to publish a
>> WAP snapshot? I am able to filter the snapshot, based on wap.id in
>> summary of Snapshot, but clueless the official recommendation on
>> committing that snapshot. I can think of cherry-picking Appended/Deleted
>> files, but don't know the nuances of missing something important with this.
>>
>> Thanks,
>> -Ashish
>>
>>
>>> -- Forwarded message -----
>>> From: Ryan Blue 
>>> Date: Wed, Jul 31, 2019 at 4:41 PM
>>> Subject: Re: [DISCUSS] Write-audit-publish support
>>> To: Edgar Rodriguez 
>>> Cc: Iceberg Dev List , Anton Okolnychyi <
>>> aokolnyc...@apple.com>
>>>
>>>
>>> Hi everyone, I've added PR #342
>>> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg
>>> repository with our WAP changes. Please have a look if you were interested
>>> in this.
>>>
>>> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <
>>> edgar.rodrig...@airbnb.com> wrote:
>>>
>>>> I think this use case is pretty helpful in most data environments, we
>>>> do the same sort of stage-check-publish pattern to run quality checks.
>>>> One question is, if say the audit part fails, is there a way to expire
>>>> the snapshot or what would be the workflow that follows?
>>>>
>>>> Best,
>>>> Edgar
>>>>
>>>> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <
>>>> moulimukher...@gmail.com> wrote:
>>>>
>>>>> This would be super helpful. We have a similar workflow where we do
>>>>> some validation before letting the downstream consume the changes.
>>>>>
>>>>> Best,
>>>>> Mouli
>>>>>
>>>>> On Mon, Jul 22, 2019 at 9:18 AM Filip  wrote:
>>>>>
>>>>>> This definitely sounds interesting. Quick question on whether this
>>>>>> presents impact on the current Upserts spec? Or is it maybe that we are
>>>>>> looking to associate this support for append-only commits?
>>>>>>
>>>>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue 
>>>>>> wrote:
>>>>>>
>>>>>>> Audits run on the snapshot by setting the snapshot-id read option
>>>>>>> to read the WAP snapshot, even though it has not (yet) been the current
>>>>>>> table state. This is documented in the time travel
>>>>>>> <http://iceberg.apache.org/spark/#time-travel> section of the
>>>>>>> Iceberg site.
>>>>>>>
>>>>>>> We added a stageOnly method to SnapshotProducer that adds the
>>>>>>> snapshot to table metadata, but does not make it the current table 
>>>>>>> state.
>>>>>>> That is called by the Spark writer when there is a WAP ID, and that ID 
>>>>>>> is
>>>>>>> embedded in the staged snapshot’s metadata so processes can find it.
>>>>>>>
>>>>>>> I'll add a PR with this code, since there is interest.
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <
>>>>>>> aokolnyc...@apple.com> wrote:
>>>>>>>
>>>>>>>> I would also support adding this to Iceberg itself. I think we have
>>>>>>>> a use case where we can leverage this.
>>>>>>>>
>>>>>>>> @Ryan, could you also provide more info on the audit process?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Anton
>>>>>>>>
>>>>>>>> On 20 Jul 2019, at 04:01, RD  wrote:
>>>>>

Re: [DISCUSS] Write-audit-publish support

2019-11-08 Thread Ryan Blue
Ashish, you can use the rollback table operation to set a particular
snapshot as the current table state. Like this:

Table table = hiveCatalog.load(name);
table.rollback().toSnapshotId(id).commmit();


On Fri, Nov 8, 2019 at 12:52 PM Ashish Mehta 
wrote:

> Hi Ryan,
>
> Can you please help me point to doc, where I can find how to publish a WAP
> snapshot? I am able to filter the snapshot, based on wap.id in summary of
> Snapshot, but clueless the official recommendation on committing that
> snapshot. I can think of cherry-picking Appended/Deleted files, but don't
> know the nuances of missing something important with this.
>
> Thanks,
> -Ashish
>
>
>> -- Forwarded message -
>> From: Ryan Blue 
>> Date: Wed, Jul 31, 2019 at 4:41 PM
>> Subject: Re: [DISCUSS] Write-audit-publish support
>> To: Edgar Rodriguez 
>> Cc: Iceberg Dev List , Anton Okolnychyi <
>> aokolnyc...@apple.com>
>>
>>
>> Hi everyone, I've added PR #342
>> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg
>> repository with our WAP changes. Please have a look if you were interested
>> in this.
>>
>> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <
>> edgar.rodrig...@airbnb.com> wrote:
>>
>>> I think this use case is pretty helpful in most data environments, we do
>>> the same sort of stage-check-publish pattern to run quality checks.
>>> One question is, if say the audit part fails, is there a way to expire
>>> the snapshot or what would be the workflow that follows?
>>>
>>> Best,
>>> Edgar
>>>
>>> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <
>>> moulimukher...@gmail.com> wrote:
>>>
>>>> This would be super helpful. We have a similar workflow where we do
>>>> some validation before letting the downstream consume the changes.
>>>>
>>>> Best,
>>>> Mouli
>>>>
>>>> On Mon, Jul 22, 2019 at 9:18 AM Filip  wrote:
>>>>
>>>>> This definitely sounds interesting. Quick question on whether this
>>>>> presents impact on the current Upserts spec? Or is it maybe that we are
>>>>> looking to associate this support for append-only commits?
>>>>>
>>>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue 
>>>>> wrote:
>>>>>
>>>>>> Audits run on the snapshot by setting the snapshot-id read option to
>>>>>> read the WAP snapshot, even though it has not (yet) been the current 
>>>>>> table
>>>>>> state. This is documented in the time travel
>>>>>> <http://iceberg.apache.org/spark/#time-travel> section of the
>>>>>> Iceberg site.
>>>>>>
>>>>>> We added a stageOnly method to SnapshotProducer that adds the
>>>>>> snapshot to table metadata, but does not make it the current table state.
>>>>>> That is called by the Spark writer when there is a WAP ID, and that ID is
>>>>>> embedded in the staged snapshot’s metadata so processes can find it.
>>>>>>
>>>>>> I'll add a PR with this code, since there is interest.
>>>>>>
>>>>>> rb
>>>>>>
>>>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <
>>>>>> aokolnyc...@apple.com> wrote:
>>>>>>
>>>>>>> I would also support adding this to Iceberg itself. I think we have
>>>>>>> a use case where we can leverage this.
>>>>>>>
>>>>>>> @Ryan, could you also provide more info on the audit process?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Anton
>>>>>>>
>>>>>>> On 20 Jul 2019, at 04:01, RD  wrote:
>>>>>>>
>>>>>>> I think this could be useful. When we ingest data from Kafka, we do
>>>>>>> a predefined set of checks on the data. We can potentially utilize
>>>>>>> something like this to check for sanity before publishing.
>>>>>>>
>>>>>>> How is the auditing process suppose to find the new snapshot , since
>>>>>>> it is not accessible from the table. Is it by convention?
>>>>>>>
>>>>>>> -R
>>>>>>>
>>>>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue 
>>>>>>> wrote:
>>>>&g

[DISCUSS] Write-audit-publish support

2019-11-08 Thread Ashish Mehta
Hi Ryan,

Can you please help me point to doc, where I can find how to publish a WAP
snapshot? I am able to filter the snapshot, based on wap.id in summary of
Snapshot, but clueless the official recommendation on committing that
snapshot. I can think of cherry-picking Appended/Deleted files, but don't
know the nuances of missing something important with this.

Thanks,
-Ashish


> -- Forwarded message -
> From: Ryan Blue 
> Date: Wed, Jul 31, 2019 at 4:41 PM
> Subject: Re: [DISCUSS] Write-audit-publish support
> To: Edgar Rodriguez 
> Cc: Iceberg Dev List , Anton Okolnychyi <
> aokolnyc...@apple.com>
>
>
> Hi everyone, I've added PR #342
> <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg
> repository with our WAP changes. Please have a look if you were interested
> in this.
>
> On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <
> edgar.rodrig...@airbnb.com> wrote:
>
>> I think this use case is pretty helpful in most data environments, we do
>> the same sort of stage-check-publish pattern to run quality checks.
>> One question is, if say the audit part fails, is there a way to expire
>> the snapshot or what would be the workflow that follows?
>>
>> Best,
>> Edgar
>>
>> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee 
>> wrote:
>>
>>> This would be super helpful. We have a similar workflow where we do some
>>> validation before letting the downstream consume the changes.
>>>
>>> Best,
>>> Mouli
>>>
>>> On Mon, Jul 22, 2019 at 9:18 AM Filip  wrote:
>>>
>>>> This definitely sounds interesting. Quick question on whether this
>>>> presents impact on the current Upserts spec? Or is it maybe that we are
>>>> looking to associate this support for append-only commits?
>>>>
>>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue 
>>>> wrote:
>>>>
>>>>> Audits run on the snapshot by setting the snapshot-id read option to
>>>>> read the WAP snapshot, even though it has not (yet) been the current table
>>>>> state. This is documented in the time travel
>>>>> <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg
>>>>> site.
>>>>>
>>>>> We added a stageOnly method to SnapshotProducer that adds the
>>>>> snapshot to table metadata, but does not make it the current table state.
>>>>> That is called by the Spark writer when there is a WAP ID, and that ID is
>>>>> embedded in the staged snapshot’s metadata so processes can find it.
>>>>>
>>>>> I'll add a PR with this code, since there is interest.
>>>>>
>>>>> rb
>>>>>
>>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <
>>>>> aokolnyc...@apple.com> wrote:
>>>>>
>>>>>> I would also support adding this to Iceberg itself. I think we have a
>>>>>> use case where we can leverage this.
>>>>>>
>>>>>> @Ryan, could you also provide more info on the audit process?
>>>>>>
>>>>>> Thanks,
>>>>>> Anton
>>>>>>
>>>>>> On 20 Jul 2019, at 04:01, RD  wrote:
>>>>>>
>>>>>> I think this could be useful. When we ingest data from Kafka, we do a
>>>>>> predefined set of checks on the data. We can potentially utilize 
>>>>>> something
>>>>>> like this to check for sanity before publishing.
>>>>>>
>>>>>> How is the auditing process suppose to find the new snapshot , since
>>>>>> it is not accessible from the table. Is it by convention?
>>>>>>
>>>>>> -R
>>>>>>
>>>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> At Netflix, we have a pattern for building ETL jobs where we write
>>>>>>> data, then audit the result before publishing the data that was written 
>>>>>>> to
>>>>>>> a final table. We call this WAP for write, audit, publish.
>>>>>>>
>>>>>>> We’ve added support in our Iceberg branch. A WAP write creates a new
>>>>>>> table snapshot, but doesn’t make that snapshot the current version of 
>>>>>>> the
>>>&g

Re: [DISCUSS] Write-audit-publish support

2019-07-31 Thread Ryan Blue
Hi everyone, I've added PR #342
 to the Iceberg
repository with our WAP changes. Please have a look if you were interested
in this.

On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez 
wrote:

> I think this use case is pretty helpful in most data environments, we do
> the same sort of stage-check-publish pattern to run quality checks.
> One question is, if say the audit part fails, is there a way to expire the
> snapshot or what would be the workflow that follows?
>
> Best,
> Edgar
>
> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee 
> wrote:
>
>> This would be super helpful. We have a similar workflow where we do some
>> validation before letting the downstream consume the changes.
>>
>> Best,
>> Mouli
>>
>> On Mon, Jul 22, 2019 at 9:18 AM Filip  wrote:
>>
>>> This definitely sounds interesting. Quick question on whether this
>>> presents impact on the current Upserts spec? Or is it maybe that we are
>>> looking to associate this support for append-only commits?
>>>
>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue 
>>> wrote:
>>>
 Audits run on the snapshot by setting the snapshot-id read option to
 read the WAP snapshot, even though it has not (yet) been the current table
 state. This is documented in the time travel
  section of the Iceberg
 site.

 We added a stageOnly method to SnapshotProducer that adds the snapshot
 to table metadata, but does not make it the current table state. That is
 called by the Spark writer when there is a WAP ID, and that ID is embedded
 in the staged snapshot’s metadata so processes can find it.

 I'll add a PR with this code, since there is interest.

 rb

 On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi 
 wrote:

> I would also support adding this to Iceberg itself. I think we have a
> use case where we can leverage this.
>
> @Ryan, could you also provide more info on the audit process?
>
> Thanks,
> Anton
>
> On 20 Jul 2019, at 04:01, RD  wrote:
>
> I think this could be useful. When we ingest data from Kafka, we do a
> predefined set of checks on the data. We can potentially utilize something
> like this to check for sanity before publishing.
>
> How is the auditing process suppose to find the new snapshot , since
> it is not accessible from the table. Is it by convention?
>
> -R
>
> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue 
> wrote:
>
>> Hi everyone,
>>
>> At Netflix, we have a pattern for building ETL jobs where we write
>> data, then audit the result before publishing the data that was written 
>> to
>> a final table. We call this WAP for write, audit, publish.
>>
>> We’ve added support in our Iceberg branch. A WAP write creates a new
>> table snapshot, but doesn’t make that snapshot the current version of the
>> table. Instead, a separate process audits the new snapshot and updates 
>> the
>> table’s current snapshot when the audits succeed. I wasn’t sure that this
>> would be useful anywhere else until we talked to another company this 
>> week
>> that is interested in the same thing. So I wanted to check whether this 
>> is
>> a good feature to include in Iceberg itself.
>>
>> This works by staging a snapshot. Basically, Spark writes data as
>> expected, but Iceberg detects that it should not update the table’s 
>> current
>> stage. That happens when there is a Spark property, spark.wap.id,
>> that indicates the job is a WAP job. Then any table that has WAP enabled 
>> by
>> the table property write.wap.enabled=true will stage the new
>> snapshot instead of fully committing, with the WAP ID in the snapshot’s
>> metadata.
>>
>> Is this something we should open a PR to add to Iceberg? It seems a
>> little strange to make it appear that a commit has succeeded, but not
>> actually change a table, which is why we didn’t submit it before now.
>>
>> Thanks,
>>
>> rb
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>

 --
 Ryan Blue
 Software Engineer
 Netflix

>>>
>>>
>>> --
>>> Filip Bocse
>>>
>>
>
> --
> Edgar Rodriguez
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Write-audit-publish support

2019-07-22 Thread Edgar Rodriguez
I think this use case is pretty helpful in most data environments, we do
the same sort of stage-check-publish pattern to run quality checks.
One question is, if say the audit part fails, is there a way to expire the
snapshot or what would be the workflow that follows?

Best,
Edgar

On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee 
wrote:

> This would be super helpful. We have a similar workflow where we do some
> validation before letting the downstream consume the changes.
>
> Best,
> Mouli
>
> On Mon, Jul 22, 2019 at 9:18 AM Filip  wrote:
>
>> This definitely sounds interesting. Quick question on whether this
>> presents impact on the current Upserts spec? Or is it maybe that we are
>> looking to associate this support for append-only commits?
>>
>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue 
>> wrote:
>>
>>> Audits run on the snapshot by setting the snapshot-id read option to
>>> read the WAP snapshot, even though it has not (yet) been the current table
>>> state. This is documented in the time travel
>>>  section of the Iceberg
>>> site.
>>>
>>> We added a stageOnly method to SnapshotProducer that adds the snapshot
>>> to table metadata, but does not make it the current table state. That is
>>> called by the Spark writer when there is a WAP ID, and that ID is embedded
>>> in the staged snapshot’s metadata so processes can find it.
>>>
>>> I'll add a PR with this code, since there is interest.
>>>
>>> rb
>>>
>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi 
>>> wrote:
>>>
 I would also support adding this to Iceberg itself. I think we have a
 use case where we can leverage this.

 @Ryan, could you also provide more info on the audit process?

 Thanks,
 Anton

 On 20 Jul 2019, at 04:01, RD  wrote:

 I think this could be useful. When we ingest data from Kafka, we do a
 predefined set of checks on the data. We can potentially utilize something
 like this to check for sanity before publishing.

 How is the auditing process suppose to find the new snapshot , since it
 is not accessible from the table. Is it by convention?

 -R

 On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue 
 wrote:

> Hi everyone,
>
> At Netflix, we have a pattern for building ETL jobs where we write
> data, then audit the result before publishing the data that was written to
> a final table. We call this WAP for write, audit, publish.
>
> We’ve added support in our Iceberg branch. A WAP write creates a new
> table snapshot, but doesn’t make that snapshot the current version of the
> table. Instead, a separate process audits the new snapshot and updates the
> table’s current snapshot when the audits succeed. I wasn’t sure that this
> would be useful anywhere else until we talked to another company this week
> that is interested in the same thing. So I wanted to check whether this is
> a good feature to include in Iceberg itself.
>
> This works by staging a snapshot. Basically, Spark writes data as
> expected, but Iceberg detects that it should not update the table’s 
> current
> stage. That happens when there is a Spark property, spark.wap.id,
> that indicates the job is a WAP job. Then any table that has WAP enabled 
> by
> the table property write.wap.enabled=true will stage the new snapshot
> instead of fully committing, with the WAP ID in the snapshot’s metadata.
>
> Is this something we should open a PR to add to Iceberg? It seems a
> little strange to make it appear that a commit has succeeded, but not
> actually change a table, which is why we didn’t submit it before now.
>
> Thanks,
>
> rb
> --
> Ryan Blue
> Software Engineer
> Netflix
>


>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>> --
>> Filip Bocse
>>
>

-- 
Edgar Rodriguez


Re: [DISCUSS] Write-audit-publish support

2019-07-22 Thread Mouli Mukherjee
This would be super helpful. We have a similar workflow where we do some
validation before letting the downstream consume the changes.

Best,
Mouli

On Mon, Jul 22, 2019 at 9:18 AM Filip  wrote:

> This definitely sounds interesting. Quick question on whether this
> presents impact on the current Upserts spec? Or is it maybe that we are
> looking to associate this support for append-only commits?
>
> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue 
> wrote:
>
>> Audits run on the snapshot by setting the snapshot-id read option to
>> read the WAP snapshot, even though it has not (yet) been the current table
>> state. This is documented in the time travel
>>  section of the Iceberg
>> site.
>>
>> We added a stageOnly method to SnapshotProducer that adds the snapshot
>> to table metadata, but does not make it the current table state. That is
>> called by the Spark writer when there is a WAP ID, and that ID is embedded
>> in the staged snapshot’s metadata so processes can find it.
>>
>> I'll add a PR with this code, since there is interest.
>>
>> rb
>>
>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi 
>> wrote:
>>
>>> I would also support adding this to Iceberg itself. I think we have a
>>> use case where we can leverage this.
>>>
>>> @Ryan, could you also provide more info on the audit process?
>>>
>>> Thanks,
>>> Anton
>>>
>>> On 20 Jul 2019, at 04:01, RD  wrote:
>>>
>>> I think this could be useful. When we ingest data from Kafka, we do a
>>> predefined set of checks on the data. We can potentially utilize something
>>> like this to check for sanity before publishing.
>>>
>>> How is the auditing process suppose to find the new snapshot , since it
>>> is not accessible from the table. Is it by convention?
>>>
>>> -R
>>>
>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue 
>>> wrote:
>>>
 Hi everyone,

 At Netflix, we have a pattern for building ETL jobs where we write
 data, then audit the result before publishing the data that was written to
 a final table. We call this WAP for write, audit, publish.

 We’ve added support in our Iceberg branch. A WAP write creates a new
 table snapshot, but doesn’t make that snapshot the current version of the
 table. Instead, a separate process audits the new snapshot and updates the
 table’s current snapshot when the audits succeed. I wasn’t sure that this
 would be useful anywhere else until we talked to another company this week
 that is interested in the same thing. So I wanted to check whether this is
 a good feature to include in Iceberg itself.

 This works by staging a snapshot. Basically, Spark writes data as
 expected, but Iceberg detects that it should not update the table’s current
 stage. That happens when there is a Spark property, spark.wap.id, that
 indicates the job is a WAP job. Then any table that has WAP enabled by the
 table property write.wap.enabled=true will stage the new snapshot
 instead of fully committing, with the WAP ID in the snapshot’s metadata.

 Is this something we should open a PR to add to Iceberg? It seems a
 little strange to make it appear that a commit has succeeded, but not
 actually change a table, which is why we didn’t submit it before now.

 Thanks,

 rb
 --
 Ryan Blue
 Software Engineer
 Netflix

>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Filip Bocse
>


Re: [DISCUSS] Write-audit-publish support

2019-07-22 Thread Filip
This definitely sounds interesting. Quick question on whether this presents
impact on the current Upserts spec? Or is it maybe that we are looking to
associate this support for append-only commits?

On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue  wrote:

> Audits run on the snapshot by setting the snapshot-id read option to read
> the WAP snapshot, even though it has not (yet) been the current table
> state. This is documented in the time travel
>  section of the Iceberg
> site.
>
> We added a stageOnly method to SnapshotProducer that adds the snapshot to
> table metadata, but does not make it the current table state. That is
> called by the Spark writer when there is a WAP ID, and that ID is embedded
> in the staged snapshot’s metadata so processes can find it.
>
> I'll add a PR with this code, since there is interest.
>
> rb
>
> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi 
> wrote:
>
>> I would also support adding this to Iceberg itself. I think we have a use
>> case where we can leverage this.
>>
>> @Ryan, could you also provide more info on the audit process?
>>
>> Thanks,
>> Anton
>>
>> On 20 Jul 2019, at 04:01, RD  wrote:
>>
>> I think this could be useful. When we ingest data from Kafka, we do a
>> predefined set of checks on the data. We can potentially utilize something
>> like this to check for sanity before publishing.
>>
>> How is the auditing process suppose to find the new snapshot , since it
>> is not accessible from the table. Is it by convention?
>>
>> -R
>>
>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> At Netflix, we have a pattern for building ETL jobs where we write data,
>>> then audit the result before publishing the data that was written to a
>>> final table. We call this WAP for write, audit, publish.
>>>
>>> We’ve added support in our Iceberg branch. A WAP write creates a new
>>> table snapshot, but doesn’t make that snapshot the current version of the
>>> table. Instead, a separate process audits the new snapshot and updates the
>>> table’s current snapshot when the audits succeed. I wasn’t sure that this
>>> would be useful anywhere else until we talked to another company this week
>>> that is interested in the same thing. So I wanted to check whether this is
>>> a good feature to include in Iceberg itself.
>>>
>>> This works by staging a snapshot. Basically, Spark writes data as
>>> expected, but Iceberg detects that it should not update the table’s current
>>> stage. That happens when there is a Spark property, spark.wap.id, that
>>> indicates the job is a WAP job. Then any table that has WAP enabled by the
>>> table property write.wap.enabled=true will stage the new snapshot
>>> instead of fully committing, with the WAP ID in the snapshot’s metadata.
>>>
>>> Is this something we should open a PR to add to Iceberg? It seems a
>>> little strange to make it appear that a commit has succeeded, but not
>>> actually change a table, which is why we didn’t submit it before now.
>>>
>>> Thanks,
>>>
>>> rb
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Filip Bocse


Re: [DISCUSS] Write-audit-publish support

2019-07-22 Thread Ryan Blue
Audits run on the snapshot by setting the snapshot-id read option to read
the WAP snapshot, even though it has not (yet) been the current table
state. This is documented in the time travel
 section of the Iceberg site.

We added a stageOnly method to SnapshotProducer that adds the snapshot to
table metadata, but does not make it the current table state. That is
called by the Spark writer when there is a WAP ID, and that ID is embedded
in the staged snapshot’s metadata so processes can find it.

I'll add a PR with this code, since there is interest.

rb

On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi 
wrote:

> I would also support adding this to Iceberg itself. I think we have a use
> case where we can leverage this.
>
> @Ryan, could you also provide more info on the audit process?
>
> Thanks,
> Anton
>
> On 20 Jul 2019, at 04:01, RD  wrote:
>
> I think this could be useful. When we ingest data from Kafka, we do a
> predefined set of checks on the data. We can potentially utilize something
> like this to check for sanity before publishing.
>
> How is the auditing process suppose to find the new snapshot , since it is
> not accessible from the table. Is it by convention?
>
> -R
>
> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue 
> wrote:
>
>> Hi everyone,
>>
>> At Netflix, we have a pattern for building ETL jobs where we write data,
>> then audit the result before publishing the data that was written to a
>> final table. We call this WAP for write, audit, publish.
>>
>> We’ve added support in our Iceberg branch. A WAP write creates a new
>> table snapshot, but doesn’t make that snapshot the current version of the
>> table. Instead, a separate process audits the new snapshot and updates the
>> table’s current snapshot when the audits succeed. I wasn’t sure that this
>> would be useful anywhere else until we talked to another company this week
>> that is interested in the same thing. So I wanted to check whether this is
>> a good feature to include in Iceberg itself.
>>
>> This works by staging a snapshot. Basically, Spark writes data as
>> expected, but Iceberg detects that it should not update the table’s current
>> stage. That happens when there is a Spark property, spark.wap.id, that
>> indicates the job is a WAP job. Then any table that has WAP enabled by the
>> table property write.wap.enabled=true will stage the new snapshot
>> instead of fully committing, with the WAP ID in the snapshot’s metadata.
>>
>> Is this something we should open a PR to add to Iceberg? It seems a
>> little strange to make it appear that a commit has succeeded, but not
>> actually change a table, which is why we didn’t submit it before now.
>>
>> Thanks,
>>
>> rb
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Write-audit-publish support

2019-07-22 Thread Anton Okolnychyi
I would also support adding this to Iceberg itself. I think we have a use case 
where we can leverage this.

@Ryan, could you also provide more info on the audit process?

Thanks,
Anton

> On 20 Jul 2019, at 04:01, RD  wrote:
> 
> I think this could be useful. When we ingest data from Kafka, we do a 
> predefined set of checks on the data. We can potentially utilize something 
> like this to check for sanity before publishing.  
> 
> How is the auditing process suppose to find the new snapshot , since it is 
> not accessible from the table. Is it by convention?
> 
> -R 
> 
> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue  wrote:
> Hi everyone,
> 
> At Netflix, we have a pattern for building ETL jobs where we write data, then 
> audit the result before publishing the data that was written to a final 
> table. We call this WAP for write, audit, publish.
> 
> We’ve added support in our Iceberg branch. A WAP write creates a new table 
> snapshot, but doesn’t make that snapshot the current version of the table. 
> Instead, a separate process audits the new snapshot and updates the table’s 
> current snapshot when the audits succeed. I wasn’t sure that this would be 
> useful anywhere else until we talked to another company this week that is 
> interested in the same thing. So I wanted to check whether this is a good 
> feature to include in Iceberg itself.
> 
> This works by staging a snapshot. Basically, Spark writes data as expected, 
> but Iceberg detects that it should not update the table’s current stage. That 
> happens when there is a Spark property, spark.wap.id , 
> that indicates the job is a WAP job. Then any table that has WAP enabled by 
> the table property write.wap.enabled=true will stage the new snapshot instead 
> of fully committing, with the WAP ID in the snapshot’s metadata.
> 
> Is this something we should open a PR to add to Iceberg? It seems a little 
> strange to make it appear that a commit has succeeded, but not actually 
> change a table, which is why we didn’t submit it before now.
> 
> Thanks,
> 
> rb
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix



Re: [DISCUSS] Write-audit-publish support

2019-07-19 Thread RD
I think this could be useful. When we ingest data from Kafka, we do a
predefined set of checks on the data. We can potentially utilize something
like this to check for sanity before publishing.

How is the auditing process suppose to find the new snapshot , since it is
not accessible from the table. Is it by convention?

-R

On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue  wrote:

> Hi everyone,
>
> At Netflix, we have a pattern for building ETL jobs where we write data,
> then audit the result before publishing the data that was written to a
> final table. We call this WAP for write, audit, publish.
>
> We’ve added support in our Iceberg branch. A WAP write creates a new table
> snapshot, but doesn’t make that snapshot the current version of the table.
> Instead, a separate process audits the new snapshot and updates the table’s
> current snapshot when the audits succeed. I wasn’t sure that this would be
> useful anywhere else until we talked to another company this week that is
> interested in the same thing. So I wanted to check whether this is a good
> feature to include in Iceberg itself.
>
> This works by staging a snapshot. Basically, Spark writes data as
> expected, but Iceberg detects that it should not update the table’s current
> stage. That happens when there is a Spark property, spark.wap.id, that
> indicates the job is a WAP job. Then any table that has WAP enabled by the
> table property write.wap.enabled=true will stage the new snapshot instead
> of fully committing, with the WAP ID in the snapshot’s metadata.
>
> Is this something we should open a PR to add to Iceberg? It seems a little
> strange to make it appear that a commit has succeeded, but not actually
> change a table, which is why we didn’t submit it before now.
>
> Thanks,
>
> rb
> --
> Ryan Blue
> Software Engineer
> Netflix
>


[DISCUSS] Write-audit-publish support

2019-07-19 Thread Ryan Blue
Hi everyone,

At Netflix, we have a pattern for building ETL jobs where we write data,
then audit the result before publishing the data that was written to a
final table. We call this WAP for write, audit, publish.

We’ve added support in our Iceberg branch. A WAP write creates a new table
snapshot, but doesn’t make that snapshot the current version of the table.
Instead, a separate process audits the new snapshot and updates the table’s
current snapshot when the audits succeed. I wasn’t sure that this would be
useful anywhere else until we talked to another company this week that is
interested in the same thing. So I wanted to check whether this is a good
feature to include in Iceberg itself.

This works by staging a snapshot. Basically, Spark writes data as expected,
but Iceberg detects that it should not update the table’s current stage.
That happens when there is a Spark property, spark.wap.id, that indicates
the job is a WAP job. Then any table that has WAP enabled by the table
property write.wap.enabled=true will stage the new snapshot instead of
fully committing, with the WAP ID in the snapshot’s metadata.

Is this something we should open a PR to add to Iceberg? It seems a little
strange to make it appear that a commit has succeeded, but not actually
change a table, which is why we didn’t submit it before now.

Thanks,

rb
-- 
Ryan Blue
Software Engineer
Netflix