On Mon, Oct 5, 2020 at 2:44 PM Luke Cwik <lc...@google.com> wrote: > For the 2.25 release the Java Direct, Flink, Jet, Samza, Twister2 will use > SDF powered Read transforms. Users can opt-out > with --experiments=use_deprecated_read. >
Huzzah! In our release notes maybe be clear about the expectations for users: - semantics are expected to be the same: file bugs for any change in results - perf may vary: file bugs or write to user@ I was unable to get Spark done for 2.25 as I found out that Spark streaming > doesn't support watermark holds[1]. If someone knows more about the > watermark system in Spark I could use some guidance here as I believe I > have a version of unbounded SDF support written for Spark (I get all the > expected output from tests, just that watermarks aren't being held back so > PAssert fails). > Spark's watermarks are not comparable to Beam's. The rule as I understand it is that any data that is later than `max(seen timestamps) - allowedLateness` is dropped. One difference is that dropping is relative to the watermark instead of expiring windows, like early versions of Beam. The other difference is that it track the latest event (some call it a "high water mark" because it is the highest datetime value seen) where Beam's watermark is an approximation of the earliest (some call it a "low water mark" because it is a guarantee that it will not dip lower). When I chatted about this with Amit in the early days, it was necessary to implement a Beam-style watermark using Spark state. I think that may still be the case, for correct results. Also, I started a doc[2] to produce an updated blog post since the original > SplittableDoFn blog from 2017 is out of date[3]. I was thinking of making > this a new blog post and having the old blog post point to it. We could > also remove the old blog post and or update it. Any thoughts? > New blog post w/ pointer from the old one. Finally, I have a clean-up PR[4] that pushes the Read -> primitive Read > expansion into each of the runners instead of having it within Read > transform within beam-sdks-java-core. > Approved! I did CC a bunch of runner authors already. I think the important thing is if a default changes we should be sure everyone is OK with the perf changes, and everyone is confident that no incorrect results are produced. The abstractions between sdk-core, runners-core-*, and individual runners is important to me: - The SDK's job is to produce a portable, un-tweaked pipeline so moving flags out of SDK core (and IOs) ASAP is super important. - The runner's job is to execute that pipeline, if they can, however they want. If a runner wants to run Read transforms differently/directly that is fine. If a runner is incapable of supporting SDF, then Read is better than nothing. Etc. - The runners-core-* job is to just be internal libraries for runner authors to share code, and should not make any decisions about the Beam model, etc. Kenn 1: https://github.com/apache/beam/pull/12603 > 2: http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE > 3: https://beam.apache.org/blog/splittable-do-fn/ > 4: https://github.com/apache/beam/pull/13006 > > > On Fri, Aug 28, 2020 at 1:45 AM Maximilian Michels <m...@apache.org> wrote: > >> Thanks Luke! I've had a pass. >> >> -Max >> >> On 28.08.20 01:22, Luke Cwik wrote: >> > As an update. >> > >> > Direct and Twister2 are done. >> > Samza: is ready for review[1]. >> > Flink: is almost ready for review. [2] lays all the groundwork for the >> > migration and [3] finishes the migration (there is a timeout happening >> > in FlinkSubmissionTest that I'm trying to figure out). >> > No further updates on Spark[4] or Jet[5]. >> > >> > @Maximilian Michels <mailto:m...@apache.org> or @t...@apache.org >> > <mailto:thomas.we...@gmail.com>, can either of you take a look at the >> > Flink PRs? >> > @ke.wu...@icloud.com <mailto:ke.wu...@icloud.com>, Since Xinyu >> delegated >> > to you, can you take another look at the Samza PR? >> > >> > 1: https://github.com/apache/beam/pull/12617 >> > 2: https://github.com/apache/beam/pull/12706 >> > 3: https://github.com/apache/beam/pull/12708 >> > 4: https://github.com/apache/beam/pull/12603 >> > 5: https://github.com/apache/beam/pull/12616 >> > >> > On Tue, Aug 18, 2020 at 11:42 AM Pulasthi Supun Wickramasinghe >> > <pulasthi...@gmail.com <mailto:pulasthi...@gmail.com>> wrote: >> > >> > Hi Luke >> > >> > Will take a look at this as soon as possible and get back to you. >> > >> > Best Regards, >> > Pulasthi >> > >> > On Tue, Aug 18, 2020 at 2:30 PM Luke Cwik <lc...@google.com >> > <mailto:lc...@google.com>> wrote: >> > >> > I have made some good progress here and have gotten to the >> > following state for non-portable runners: >> > >> > DirectRunner[1]: Merged. Supports Read.Bounded and >> Read.Unbounded. >> > Twister2[2]: Ready for review. Supports Read.Bounded, the >> > current runner doesn't support unbounded pipelines. >> > Spark[3]: WIP. Supports Read.Bounded, Nexmark suite passes. Not >> > certain about level of unbounded pipeline support coverage since >> > Spark uses its own tiny suite of tests to get unbounded pipeline >> > coverage instead of the validates runner set. >> > Jet[4]: WIP. Supports Read.Bounded. Read.Unbounded definitely >> > needs additional work. >> > Sazma[5]: WIP. Supports Read.Bounded. Not certain about level of >> > unbounded pipeline support coverage since Spark uses its own >> > tiny suite of tests to get unbounded pipeline coverage instead >> > of the validates runner set. >> > Flink: Unstarted. >> > >> > @Pulasthi Supun Wickramasinghe <mailto:pulasthi...@gmail.com> , >> > can you help me with the Twister2 PR[2]? >> > @Ismaël Mejía <mailto:ieme...@gmail.com>, is PR[3] the expected >> > level of support for unbounded pipelines and hence ready for >> review? >> > @Jozsef Bartok <mailto:jo...@hazelcast.com>, can you help me >> out >> > to get support for unbounded splittable DoFn's into Jet[4]? >> > @Xinyu Liu <mailto:xinyuliu...@gmail.com>, is PR[5] the >> expected >> > level of support for unbounded pipelines and hence ready for >> review? >> > >> > 1: https://github.com/apache/beam/pull/12519 >> > 2: https://github.com/apache/beam/pull/12594 >> > 3: https://github.com/apache/beam/pull/12603 >> > 4: https://github.com/apache/beam/pull/12616 >> > 5: https://github.com/apache/beam/pull/12617 >> > >> > On Tue, Aug 11, 2020 at 10:55 AM Luke Cwik <lc...@google.com >> > <mailto:lc...@google.com>> wrote: >> > >> > There shouldn't be any changes required since the wrapper >> > will smoothly transition the execution to be run as an SDF. >> > New IOs should strongly prefer to use SDF since it should be >> > simpler to write and will be more flexible but they can use >> > the "*Source"-based APIs. Eventually we'll deprecate the >> > APIs but we will never stop supporting them. Eventually they >> > should all be migrated to use SDF and if there is another >> > major Beam version, we'll finally be able to remove them. >> > >> > On Tue, Aug 11, 2020 at 8:40 AM Alexey Romanenko >> > <aromanenko....@gmail.com <mailto:aromanenko....@gmail.com >> >> >> > wrote: >> > >> > Hi Luke, >> > >> > Great to hear about such progress on this! >> > >> > Talking about opt-out for all runners in the future, >> > will it require any code changes for current >> > “*Source”-based IOs or the wrappers should completely >> > smooth this transition? >> > Do we need to require to create new IOs only based on >> > SDF or again, the wrappers should help to avoid this? >> > >> >> On 10 Aug 2020, at 22:59, Luke Cwik <lc...@google.com >> >> <mailto:lc...@google.com>> wrote: >> >> >> >> In the past couple of months wrappers[1, 2] have been >> >> added to the Beam Java SDK which can execute >> >> BoundedSource and UnboundedSource as Splittable DoFns. >> >> These have been opt-out for portable pipelines (e.g. >> >> Dataflow runner v2, XLang pipelines on Flink/Spark) >> >> and opt-in using an experiment for all other pipelines. >> >> >> >> I would like to start making the non-portable >> >> pipelines starting with the DirectRunner[3] to be >> >> opt-out with the plan that eventually all runners will >> >> only execute splittable DoFns and the >> >> BoundedSource/UnboundedSource specific execution logic >> >> from the runners will be removed. >> >> >> >> Users will be able to opt-in any pipeline using the >> >> experiment 'use_sdf_read' and opt-out with the >> >> experiment 'use_deprecated_read'. (For portable >> >> pipelines these experiments were 'beam_fn_api' and >> >> 'beam_fn_api_use_deprecated_read' respectively and I >> >> have added these two additional aliases to make the >> >> experience less confusing). >> >> >> >> 1: >> >> >> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L275 >> >> 2: >> >> >> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L449 >> >> 3: https://github.com/apache/beam/pull/12519 >> > >> > >> > >> > -- >> > Pulasthi S. Wickramasinghe >> > PhD Candidate | Research Assistant >> > School of Informatics and Computing | Digital Science Center >> > Indiana University, Bloomington >> > cell: 224-386-9035 <(224)%20386-9035> <tel:(224)%20386-9035> >> > >> >