Re: [DISCUSS] Write-audit-publish support

Ryan Blue Wed, 31 Jul 2019 16:41:35 -0700

Hi everyone, I've added PR #342
<https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg
repository with our WAP changes. Please have a look if you were interested
in this.


On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <edgar.rodrig...@airbnb.com>
wrote:

> I think this use case is pretty helpful in most data environments, we do
> the same sort of stage-check-publish pattern to run quality checks.
> One question is, if say the audit part fails, is there a way to expire the
> snapshot or what would be the workflow that follows?
>
> Best,
> Edgar
>
> On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <moulimukher...@gmail.com>
> wrote:
>
>> This would be super helpful. We have a similar workflow where we do some
>> validation before letting the downstream consume the changes.
>>
>> Best,
>> Mouli
>>
>> On Mon, Jul 22, 2019 at 9:18 AM Filip <filip....@gmail.com> wrote:
>>
>>> This definitely sounds interesting. Quick question on whether this
>>> presents impact on the current Upserts spec? Or is it maybe that we are
>>> looking to associate this support for append-only commits?
>>>
>>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> Audits run on the snapshot by setting the snapshot-id read option to
>>>> read the WAP snapshot, even though it has not (yet) been the current table
>>>> state. This is documented in the time travel
>>>> <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg
>>>> site.
>>>>
>>>> We added a stageOnly method to SnapshotProducer that adds the snapshot
>>>> to table metadata, but does not make it the current table state. That is
>>>> called by the Spark writer when there is a WAP ID, and that ID is embedded
>>>> in the staged snapshot’s metadata so processes can find it.
>>>>
>>>> I'll add a PR with this code, since there is interest.
>>>>
>>>> rb
>>>>
>>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <aokolnyc...@apple.com>
>>>> wrote:
>>>>
>>>>> I would also support adding this to Iceberg itself. I think we have a
>>>>> use case where we can leverage this.
>>>>>
>>>>> @Ryan, could you also provide more info on the audit process?
>>>>>
>>>>> Thanks,
>>>>> Anton
>>>>>
>>>>> On 20 Jul 2019, at 04:01, RD <rdsr...@gmail.com> wrote:
>>>>>
>>>>> I think this could be useful. When we ingest data from Kafka, we do a
>>>>> predefined set of checks on the data. We can potentially utilize something
>>>>> like this to check for sanity before publishing.
>>>>>
>>>>> How is the auditing process suppose to find the new snapshot , since
>>>>> it is not accessible from the table. Is it by convention?
>>>>>
>>>>> -R
>>>>>
>>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> At Netflix, we have a pattern for building ETL jobs where we write
>>>>>> data, then audit the result before publishing the data that was written 
>>>>>> to
>>>>>> a final table. We call this WAP for write, audit, publish.
>>>>>>
>>>>>> We’ve added support in our Iceberg branch. A WAP write creates a new
>>>>>> table snapshot, but doesn’t make that snapshot the current version of the
>>>>>> table. Instead, a separate process audits the new snapshot and updates 
>>>>>> the
>>>>>> table’s current snapshot when the audits succeed. I wasn’t sure that this
>>>>>> would be useful anywhere else until we talked to another company this 
>>>>>> week
>>>>>> that is interested in the same thing. So I wanted to check whether this 
>>>>>> is
>>>>>> a good feature to include in Iceberg itself.
>>>>>>
>>>>>> This works by staging a snapshot. Basically, Spark writes data as
>>>>>> expected, but Iceberg detects that it should not update the table’s 
>>>>>> current
>>>>>> stage. That happens when there is a Spark property, spark.wap.id,
>>>>>> that indicates the job is a WAP job. Then any table that has WAP enabled 
>>>>>> by
>>>>>> the table property write.wap.enabled=true will stage the new
>>>>>> snapshot instead of fully committing, with the WAP ID in the snapshot’s
>>>>>> metadata.
>>>>>>
>>>>>> Is this something we should open a PR to add to Iceberg? It seems a
>>>>>> little strange to make it appear that a commit has succeeded, but not
>>>>>> actually change a table, which is why we didn’t submit it before now.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> rb
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>> --
>>> Filip Bocse
>>>
>>
>
> --
> Edgar Rodriguez
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Write-audit-publish support

Reply via email to