For the 2.25 release the Java Direct, Flink, Jet, Samza, Twister2 will use SDF powered Read transforms. Users can opt-out with --experiments=use_deprecated_read.
I was unable to get Spark done for 2.25 as I found out that Spark streaming doesn't support watermark holds[1]. If someone knows more about the watermark system in Spark I could use some guidance here as I believe I have a version of unbounded SDF support written for Spark (I get all the expected output from tests, just that watermarks aren't being held back so PAssert fails). Also, I started a doc[2] to produce an updated blog post since the original SplittableDoFn blog from 2017 is out of date[3]. I was thinking of making this a new blog post and having the old blog post point to it. We could also remove the old blog post and or update it. Any thoughts? Finally, I have a clean-up PR[4] that pushes the Read -> primitive Read expansion into each of the runners instead of having it within Read transform within beam-sdks-java-core. 1: https://github.com/apache/beam/pull/12603 2: http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE 3: https://beam.apache.org/blog/splittable-do-fn/ 4: https://github.com/apache/beam/pull/13006 On Fri, Aug 28, 2020 at 1:45 AM Maximilian Michels <m...@apache.org> wrote: > Thanks Luke! I've had a pass. > > -Max > > On 28.08.20 01:22, Luke Cwik wrote: > > As an update. > > > > Direct and Twister2 are done. > > Samza: is ready for review[1]. > > Flink: is almost ready for review. [2] lays all the groundwork for the > > migration and [3] finishes the migration (there is a timeout happening > > in FlinkSubmissionTest that I'm trying to figure out). > > No further updates on Spark[4] or Jet[5]. > > > > @Maximilian Michels <mailto:m...@apache.org> or @t...@apache.org > > <mailto:thomas.we...@gmail.com>, can either of you take a look at the > > Flink PRs? > > @ke.wu...@icloud.com <mailto:ke.wu...@icloud.com>, Since Xinyu > delegated > > to you, can you take another look at the Samza PR? > > > > 1: https://github.com/apache/beam/pull/12617 > > 2: https://github.com/apache/beam/pull/12706 > > 3: https://github.com/apache/beam/pull/12708 > > 4: https://github.com/apache/beam/pull/12603 > > 5: https://github.com/apache/beam/pull/12616 > > > > On Tue, Aug 18, 2020 at 11:42 AM Pulasthi Supun Wickramasinghe > > <pulasthi...@gmail.com <mailto:pulasthi...@gmail.com>> wrote: > > > > Hi Luke > > > > Will take a look at this as soon as possible and get back to you. > > > > Best Regards, > > Pulasthi > > > > On Tue, Aug 18, 2020 at 2:30 PM Luke Cwik <lc...@google.com > > <mailto:lc...@google.com>> wrote: > > > > I have made some good progress here and have gotten to the > > following state for non-portable runners: > > > > DirectRunner[1]: Merged. Supports Read.Bounded and > Read.Unbounded. > > Twister2[2]: Ready for review. Supports Read.Bounded, the > > current runner doesn't support unbounded pipelines. > > Spark[3]: WIP. Supports Read.Bounded, Nexmark suite passes. Not > > certain about level of unbounded pipeline support coverage since > > Spark uses its own tiny suite of tests to get unbounded pipeline > > coverage instead of the validates runner set. > > Jet[4]: WIP. Supports Read.Bounded. Read.Unbounded definitely > > needs additional work. > > Sazma[5]: WIP. Supports Read.Bounded. Not certain about level of > > unbounded pipeline support coverage since Spark uses its own > > tiny suite of tests to get unbounded pipeline coverage instead > > of the validates runner set. > > Flink: Unstarted. > > > > @Pulasthi Supun Wickramasinghe <mailto:pulasthi...@gmail.com> , > > can you help me with the Twister2 PR[2]? > > @Ismaël Mejía <mailto:ieme...@gmail.com>, is PR[3] the expected > > level of support for unbounded pipelines and hence ready for > review? > > @Jozsef Bartok <mailto:jo...@hazelcast.com>, can you help me out > > to get support for unbounded splittable DoFn's into Jet[4]? > > @Xinyu Liu <mailto:xinyuliu...@gmail.com>, is PR[5] the expected > > level of support for unbounded pipelines and hence ready for > review? > > > > 1: https://github.com/apache/beam/pull/12519 > > 2: https://github.com/apache/beam/pull/12594 > > 3: https://github.com/apache/beam/pull/12603 > > 4: https://github.com/apache/beam/pull/12616 > > 5: https://github.com/apache/beam/pull/12617 > > > > On Tue, Aug 11, 2020 at 10:55 AM Luke Cwik <lc...@google.com > > <mailto:lc...@google.com>> wrote: > > > > There shouldn't be any changes required since the wrapper > > will smoothly transition the execution to be run as an SDF. > > New IOs should strongly prefer to use SDF since it should be > > simpler to write and will be more flexible but they can use > > the "*Source"-based APIs. Eventually we'll deprecate the > > APIs but we will never stop supporting them. Eventually they > > should all be migrated to use SDF and if there is another > > major Beam version, we'll finally be able to remove them. > > > > On Tue, Aug 11, 2020 at 8:40 AM Alexey Romanenko > > <aromanenko....@gmail.com <mailto:aromanenko....@gmail.com>> > > wrote: > > > > Hi Luke, > > > > Great to hear about such progress on this! > > > > Talking about opt-out for all runners in the future, > > will it require any code changes for current > > “*Source”-based IOs or the wrappers should completely > > smooth this transition? > > Do we need to require to create new IOs only based on > > SDF or again, the wrappers should help to avoid this? > > > >> On 10 Aug 2020, at 22:59, Luke Cwik <lc...@google.com > >> <mailto:lc...@google.com>> wrote: > >> > >> In the past couple of months wrappers[1, 2] have been > >> added to the Beam Java SDK which can execute > >> BoundedSource and UnboundedSource as Splittable DoFns. > >> These have been opt-out for portable pipelines (e.g. > >> Dataflow runner v2, XLang pipelines on Flink/Spark) > >> and opt-in using an experiment for all other pipelines. > >> > >> I would like to start making the non-portable > >> pipelines starting with the DirectRunner[3] to be > >> opt-out with the plan that eventually all runners will > >> only execute splittable DoFns and the > >> BoundedSource/UnboundedSource specific execution logic > >> from the runners will be removed. > >> > >> Users will be able to opt-in any pipeline using the > >> experiment 'use_sdf_read' and opt-out with the > >> experiment 'use_deprecated_read'. (For portable > >> pipelines these experiments were 'beam_fn_api' and > >> 'beam_fn_api_use_deprecated_read' respectively and I > >> have added these two additional aliases to make the > >> experience less confusing). > >> > >> 1: > >> > https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L275 > >> 2: > >> > https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L449 > >> 3: https://github.com/apache/beam/pull/12519 > > > > > > > > -- > > Pulasthi S. Wickramasinghe > > PhD Candidate | Research Assistant > > School of Informatics and Computing | Digital Science Center > > Indiana University, Bloomington > > cell: 224-386-9035 <(224)%20386-9035> <tel:(224)%20386-9035> > > >