Hi everyone, I've added PR #342 <https://github.com/apache/incubator-iceberg/pull/342> to the Iceberg repository with our WAP changes. Please have a look if you were interested in this.
On Mon, Jul 22, 2019 at 11:05 AM Edgar Rodriguez <edgar.rodrig...@airbnb.com> wrote: > I think this use case is pretty helpful in most data environments, we do > the same sort of stage-check-publish pattern to run quality checks. > One question is, if say the audit part fails, is there a way to expire the > snapshot or what would be the workflow that follows? > > Best, > Edgar > > On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee <moulimukher...@gmail.com> > wrote: > >> This would be super helpful. We have a similar workflow where we do some >> validation before letting the downstream consume the changes. >> >> Best, >> Mouli >> >> On Mon, Jul 22, 2019 at 9:18 AM Filip <filip....@gmail.com> wrote: >> >>> This definitely sounds interesting. Quick question on whether this >>> presents impact on the current Upserts spec? Or is it maybe that we are >>> looking to associate this support for append-only commits? >>> >>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue <rb...@netflix.com.invalid> >>> wrote: >>> >>>> Audits run on the snapshot by setting the snapshot-id read option to >>>> read the WAP snapshot, even though it has not (yet) been the current table >>>> state. This is documented in the time travel >>>> <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg >>>> site. >>>> >>>> We added a stageOnly method to SnapshotProducer that adds the snapshot >>>> to table metadata, but does not make it the current table state. That is >>>> called by the Spark writer when there is a WAP ID, and that ID is embedded >>>> in the staged snapshot’s metadata so processes can find it. >>>> >>>> I'll add a PR with this code, since there is interest. >>>> >>>> rb >>>> >>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi <aokolnyc...@apple.com> >>>> wrote: >>>> >>>>> I would also support adding this to Iceberg itself. I think we have a >>>>> use case where we can leverage this. >>>>> >>>>> @Ryan, could you also provide more info on the audit process? >>>>> >>>>> Thanks, >>>>> Anton >>>>> >>>>> On 20 Jul 2019, at 04:01, RD <rdsr...@gmail.com> wrote: >>>>> >>>>> I think this could be useful. When we ingest data from Kafka, we do a >>>>> predefined set of checks on the data. We can potentially utilize something >>>>> like this to check for sanity before publishing. >>>>> >>>>> How is the auditing process suppose to find the new snapshot , since >>>>> it is not accessible from the table. Is it by convention? >>>>> >>>>> -R >>>>> >>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid> >>>>> wrote: >>>>> >>>>>> Hi everyone, >>>>>> >>>>>> At Netflix, we have a pattern for building ETL jobs where we write >>>>>> data, then audit the result before publishing the data that was written >>>>>> to >>>>>> a final table. We call this WAP for write, audit, publish. >>>>>> >>>>>> We’ve added support in our Iceberg branch. A WAP write creates a new >>>>>> table snapshot, but doesn’t make that snapshot the current version of the >>>>>> table. Instead, a separate process audits the new snapshot and updates >>>>>> the >>>>>> table’s current snapshot when the audits succeed. I wasn’t sure that this >>>>>> would be useful anywhere else until we talked to another company this >>>>>> week >>>>>> that is interested in the same thing. So I wanted to check whether this >>>>>> is >>>>>> a good feature to include in Iceberg itself. >>>>>> >>>>>> This works by staging a snapshot. Basically, Spark writes data as >>>>>> expected, but Iceberg detects that it should not update the table’s >>>>>> current >>>>>> stage. That happens when there is a Spark property, spark.wap.id, >>>>>> that indicates the job is a WAP job. Then any table that has WAP enabled >>>>>> by >>>>>> the table property write.wap.enabled=true will stage the new >>>>>> snapshot instead of fully committing, with the WAP ID in the snapshot’s >>>>>> metadata. >>>>>> >>>>>> Is this something we should open a PR to add to Iceberg? It seems a >>>>>> little strange to make it appear that a commit has succeeded, but not >>>>>> actually change a table, which is why we didn’t submit it before now. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> rb >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>>> >>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> >>> >>> -- >>> Filip Bocse >>> >> > > -- > Edgar Rodriguez > -- Ryan Blue Software Engineer Netflix