I have a draft[1] off the blog ready. Please take a look. 1: http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE#heading=h.tbab2n97o3eo
On Mon, Oct 5, 2020 at 4:28 PM Luke Cwik <lc...@google.com> wrote: > > > On Mon, Oct 5, 2020 at 3:45 PM Kenneth Knowles <k...@apache.org> wrote: > >> >> >> On Mon, Oct 5, 2020 at 2:44 PM Luke Cwik <lc...@google.com> wrote: >> >>> For the 2.25 release the Java Direct, Flink, Jet, Samza, Twister2 will >>> use SDF powered Read transforms. Users can opt-out >>> with --experiments=use_deprecated_read. >>> >> >> Huzzah! In our release notes maybe be clear about the expectations for >> users: >> >> Done in https://github.com/apache/beam/pull/13015 > > >> - semantics are expected to be the same: file bugs for any change in >> results >> - perf may vary: file bugs or write to user@ >> >> I was unable to get Spark done for 2.25 as I found out that Spark >>> streaming doesn't support watermark holds[1]. If someone knows more about >>> the watermark system in Spark I could use some guidance here as I believe I >>> have a version of unbounded SDF support written for Spark (I get all the >>> expected output from tests, just that watermarks aren't being held back so >>> PAssert fails). >>> >> >> Spark's watermarks are not comparable to Beam's. The rule as I understand >> it is that any data that is later than `max(seen timestamps) - >> allowedLateness` is dropped. One difference is that dropping is relative to >> the watermark instead of expiring windows, like early versions of Beam. The >> other difference is that it track the latest event (some call it a "high >> water mark" because it is the highest datetime value seen) where Beam's >> watermark is an approximation of the earliest (some call it a "low water >> mark" because it is a guarantee that it will not dip lower). When I chatted >> about this with Amit in the early days, it was necessary to implement a >> Beam-style watermark using Spark state. I think that may still be the case, >> for correct results. >> >> > In the Spark implementation I saw that watermark holds weren't wired at > all to control Sparks watermarks and this was causing triggers to fire too > early. > > >> Also, I started a doc[2] to produce an updated blog post since the >>> original SplittableDoFn blog from 2017 is out of date[3]. I was thinking of >>> making this a new blog post and having the old blog post point to it. We >>> could also remove the old blog post and or update it. Any thoughts? >>> >> >> New blog post w/ pointer from the old one. >> >> Finally, I have a clean-up PR[4] that pushes the Read -> primitive Read >>> expansion into each of the runners instead of having it within Read >>> transform within beam-sdks-java-core. >>> >> >> Approved! I did CC a bunch of runner authors already. I think the >> important thing is if a default changes we should be sure everyone is OK >> with the perf changes, and everyone is confident that no incorrect results >> are produced. The abstractions between sdk-core, runners-core-*, and >> individual runners is important to me: >> >> - The SDK's job is to produce a portable, un-tweaked pipeline so moving >> flags out of SDK core (and IOs) ASAP is super important. >> - The runner's job is to execute that pipeline, if they can, however >> they want. If a runner wants to run Read transforms differently/directly >> that is fine. If a runner is incapable of supporting SDF, then Read is >> better than nothing. Etc. >> - The runners-core-* job is to just be internal libraries for runner >> authors to share code, and should not make any decisions about the Beam >> model, etc. >> >> Kenn >> >> 1: https://github.com/apache/beam/pull/12603 >>> 2: http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE >>> 3: https://beam.apache.org/blog/splittable-do-fn/ >>> 4: https://github.com/apache/beam/pull/13006 >>> >>> >>> On Fri, Aug 28, 2020 at 1:45 AM Maximilian Michels <m...@apache.org> >>> wrote: >>> >>>> Thanks Luke! I've had a pass. >>>> >>>> -Max >>>> >>>> On 28.08.20 01:22, Luke Cwik wrote: >>>> > As an update. >>>> > >>>> > Direct and Twister2 are done. >>>> > Samza: is ready for review[1]. >>>> > Flink: is almost ready for review. [2] lays all the groundwork for >>>> the >>>> > migration and [3] finishes the migration (there is a timeout >>>> happening >>>> > in FlinkSubmissionTest that I'm trying to figure out). >>>> > No further updates on Spark[4] or Jet[5]. >>>> > >>>> > @Maximilian Michels <mailto:m...@apache.org> or @t...@apache.org >>>> > <mailto:thomas.we...@gmail.com>, can either of you take a look at >>>> the >>>> > Flink PRs? >>>> > @ke.wu...@icloud.com <mailto:ke.wu...@icloud.com>, Since Xinyu >>>> delegated >>>> > to you, can you take another look at the Samza PR? >>>> > >>>> > 1: https://github.com/apache/beam/pull/12617 >>>> > 2: https://github.com/apache/beam/pull/12706 >>>> > 3: https://github.com/apache/beam/pull/12708 >>>> > 4: https://github.com/apache/beam/pull/12603 >>>> > 5: https://github.com/apache/beam/pull/12616 >>>> > >>>> > On Tue, Aug 18, 2020 at 11:42 AM Pulasthi Supun Wickramasinghe >>>> > <pulasthi...@gmail.com <mailto:pulasthi...@gmail.com>> wrote: >>>> > >>>> > Hi Luke >>>> > >>>> > Will take a look at this as soon as possible and get back to you. >>>> > >>>> > Best Regards, >>>> > Pulasthi >>>> > >>>> > On Tue, Aug 18, 2020 at 2:30 PM Luke Cwik <lc...@google.com >>>> > <mailto:lc...@google.com>> wrote: >>>> > >>>> > I have made some good progress here and have gotten to the >>>> > following state for non-portable runners: >>>> > >>>> > DirectRunner[1]: Merged. Supports Read.Bounded and >>>> Read.Unbounded. >>>> > Twister2[2]: Ready for review. Supports Read.Bounded, the >>>> > current runner doesn't support unbounded pipelines. >>>> > Spark[3]: WIP. Supports Read.Bounded, Nexmark suite passes. >>>> Not >>>> > certain about level of unbounded pipeline support coverage >>>> since >>>> > Spark uses its own tiny suite of tests to get unbounded >>>> pipeline >>>> > coverage instead of the validates runner set. >>>> > Jet[4]: WIP. Supports Read.Bounded. Read.Unbounded definitely >>>> > needs additional work. >>>> > Sazma[5]: WIP. Supports Read.Bounded. Not certain about level >>>> of >>>> > unbounded pipeline support coverage since Spark uses its own >>>> > tiny suite of tests to get unbounded pipeline coverage instead >>>> > of the validates runner set. >>>> > Flink: Unstarted. >>>> > >>>> > @Pulasthi Supun Wickramasinghe <mailto:pulasthi...@gmail.com >>>> > , >>>> > can you help me with the Twister2 PR[2]? >>>> > @Ismaël Mejía <mailto:ieme...@gmail.com>, is PR[3] the >>>> expected >>>> > level of support for unbounded pipelines and hence ready for >>>> review? >>>> > @Jozsef Bartok <mailto:jo...@hazelcast.com>, can you help me >>>> out >>>> > to get support for unbounded splittable DoFn's into Jet[4]? >>>> > @Xinyu Liu <mailto:xinyuliu...@gmail.com>, is PR[5] the >>>> expected >>>> > level of support for unbounded pipelines and hence ready for >>>> review? >>>> > >>>> > 1: https://github.com/apache/beam/pull/12519 >>>> > 2: https://github.com/apache/beam/pull/12594 >>>> > 3: https://github.com/apache/beam/pull/12603 >>>> > 4: https://github.com/apache/beam/pull/12616 >>>> > 5: https://github.com/apache/beam/pull/12617 >>>> > >>>> > On Tue, Aug 11, 2020 at 10:55 AM Luke Cwik <lc...@google.com >>>> > <mailto:lc...@google.com>> wrote: >>>> > >>>> > There shouldn't be any changes required since the wrapper >>>> > will smoothly transition the execution to be run as an >>>> SDF. >>>> > New IOs should strongly prefer to use SDF since it should >>>> be >>>> > simpler to write and will be more flexible but they can >>>> use >>>> > the "*Source"-based APIs. Eventually we'll deprecate the >>>> > APIs but we will never stop supporting them. Eventually >>>> they >>>> > should all be migrated to use SDF and if there is another >>>> > major Beam version, we'll finally be able to remove them. >>>> > >>>> > On Tue, Aug 11, 2020 at 8:40 AM Alexey Romanenko >>>> > <aromanenko....@gmail.com <mailto: >>>> aromanenko....@gmail.com>> >>>> > wrote: >>>> > >>>> > Hi Luke, >>>> > >>>> > Great to hear about such progress on this! >>>> > >>>> > Talking about opt-out for all runners in the future, >>>> > will it require any code changes for current >>>> > “*Source”-based IOs or the wrappers should completely >>>> > smooth this transition? >>>> > Do we need to require to create new IOs only based on >>>> > SDF or again, the wrappers should help to avoid this? >>>> > >>>> >> On 10 Aug 2020, at 22:59, Luke Cwik < >>>> lc...@google.com >>>> >> <mailto:lc...@google.com>> wrote: >>>> >> >>>> >> In the past couple of months wrappers[1, 2] have been >>>> >> added to the Beam Java SDK which can execute >>>> >> BoundedSource and UnboundedSource as Splittable >>>> DoFns. >>>> >> These have been opt-out for portable pipelines (e.g. >>>> >> Dataflow runner v2, XLang pipelines on Flink/Spark) >>>> >> and opt-in using an experiment for all other >>>> pipelines. >>>> >> >>>> >> I would like to start making the non-portable >>>> >> pipelines starting with the DirectRunner[3] to be >>>> >> opt-out with the plan that eventually all runners >>>> will >>>> >> only execute splittable DoFns and the >>>> >> BoundedSource/UnboundedSource specific execution >>>> logic >>>> >> from the runners will be removed. >>>> >> >>>> >> Users will be able to opt-in any pipeline using the >>>> >> experiment 'use_sdf_read' and opt-out with the >>>> >> experiment 'use_deprecated_read'. (For portable >>>> >> pipelines these experiments were 'beam_fn_api' and >>>> >> 'beam_fn_api_use_deprecated_read' respectively and I >>>> >> have added these two additional aliases to make the >>>> >> experience less confusing). >>>> >> >>>> >> 1: >>>> >> >>>> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L275 >>>> >> 2: >>>> >> >>>> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L449 >>>> >> 3: https://github.com/apache/beam/pull/12519 >>>> > >>>> > >>>> > >>>> > -- >>>> > Pulasthi S. Wickramasinghe >>>> > PhD Candidate | Research Assistant >>>> > School of Informatics and Computing | Digital Science Center >>>> > Indiana University, Bloomington >>>> > cell: 224-386-9035 <(224)%20386-9035> <tel:(224)%20386-9035> >>>> > >>>> >>>