Re: [DISCUSS][BEAM-10670] Migrating BoundedSource/UnboundedSource to execute as a Splittable DoFn for non-portable Java runners

Kenneth Knowles Mon, 05 Oct 2020 15:46:13 -0700

On Mon, Oct 5, 2020 at 2:44 PM Luke Cwik <[email protected]> wrote:

> For the 2.25 release the Java Direct, Flink, Jet, Samza, Twister2 will use
> SDF powered Read transforms. Users can opt-out
> with --experiments=use_deprecated_read.
>


Huzzah! In our release notes maybe be clear about the expectations for
users:

 - semantics are expected to be the same: file bugs for any change in
results
 - perf may vary: file bugs or write to user@

I was unable to get Spark done for 2.25 as I found out that Spark streaming
> doesn't support watermark holds[1]. If someone knows more about the
> watermark system in Spark I could use some guidance here as I believe I
> have a version of unbounded SDF support written for Spark (I get all the
> expected output from tests, just that watermarks aren't being held back so
> PAssert fails).
>

Spark's watermarks are not comparable to Beam's. The rule as I understand
it is that any data that is later than `max(seen timestamps) -
allowedLateness` is dropped. One difference is that dropping is relative to
the watermark instead of expiring windows, like early versions of Beam. The
other difference is that it track the latest event (some call it a "high
water mark" because it is the highest datetime value seen) where Beam's
watermark is an approximation of the earliest (some call it a "low water
mark" because it is a guarantee that it will not dip lower). When I chatted
about this with Amit in the early days, it was necessary to implement a
Beam-style watermark using Spark state. I think that may still be the case,
for correct results.

Also, I started a doc[2] to produce an updated blog post since the original
> SplittableDoFn blog from 2017 is out of date[3]. I was thinking of making
> this a new blog post and having the old blog post point to it. We could
> also remove the old blog post and or update it. Any thoughts?
>

New blog post w/ pointer from the old one.

Finally, I have a clean-up PR[4] that pushes the Read -> primitive Read
> expansion into each of the runners instead of having it within Read
> transform within beam-sdks-java-core.
>

Approved! I did CC a bunch of runner authors already. I think the important
thing is if a default changes we should be sure everyone is OK with the
perf changes, and everyone is confident that no incorrect results are
produced. The abstractions between sdk-core, runners-core-*, and individual
runners is important to me:

 - The SDK's job is to produce a portable, un-tweaked pipeline so moving
flags out of SDK core (and IOs) ASAP is super important.
 - The runner's job is to execute that pipeline, if they can, however they
want. If a runner wants to run Read transforms differently/directly that is
fine. If a runner is incapable of supporting SDF, then Read is better than
nothing. Etc.
 - The runners-core-* job is to just be internal libraries for runner
authors to share code, and should not make any decisions about the Beam
model, etc.

Kenn

1: https://github.com/apache/beam/pull/12603
> 2: http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE
> 3: https://beam.apache.org/blog/splittable-do-fn/
> 4: https://github.com/apache/beam/pull/13006
>
>
> On Fri, Aug 28, 2020 at 1:45 AM Maximilian Michels <[email protected]> wrote:
>
>> Thanks Luke! I've had a pass.
>>
>> -Max
>>
>> On 28.08.20 01:22, Luke Cwik wrote:
>> > As an update.
>> >
>> > Direct and Twister2 are done.
>> > Samza: is ready for review[1].
>> > Flink: is almost ready for review. [2] lays all the groundwork for the
>> > migration and [3] finishes the migration (there is a timeout happening
>> > in FlinkSubmissionTest that I'm trying to figure out).
>> > No further updates on Spark[4] or Jet[5].
>> >
>> > @Maximilian Michels <mailto:[email protected]> or @[email protected]
>> > <mailto:[email protected]>, can either of you take a look at the
>> > Flink PRs?
>> > @[email protected] <mailto:[email protected]>, Since Xinyu
>> delegated
>> > to you, can you take another look at the Samza PR?
>> >
>> > 1: https://github.com/apache/beam/pull/12617
>> > 2: https://github.com/apache/beam/pull/12706
>> > 3: https://github.com/apache/beam/pull/12708
>> > 4: https://github.com/apache/beam/pull/12603
>> > 5: https://github.com/apache/beam/pull/12616
>> >
>> > On Tue, Aug 18, 2020 at 11:42 AM Pulasthi Supun Wickramasinghe
>> > <[email protected] <mailto:[email protected]>> wrote:
>> >
>> >     Hi Luke
>> >
>> >     Will take a look at this as soon as possible and get back to you.
>> >
>> >     Best Regards,
>> >     Pulasthi
>> >
>> >     On Tue, Aug 18, 2020 at 2:30 PM Luke Cwik <[email protected]
>> >     <mailto:[email protected]>> wrote:
>> >
>> >         I have made some good progress here and have gotten to the
>> >         following state for non-portable runners:
>> >
>> >         DirectRunner[1]: Merged. Supports Read.Bounded and
>> Read.Unbounded.
>> >         Twister2[2]: Ready for review. Supports Read.Bounded, the
>> >         current runner doesn't support unbounded pipelines.
>> >         Spark[3]: WIP. Supports Read.Bounded, Nexmark suite passes. Not
>> >         certain about level of unbounded pipeline support coverage since
>> >         Spark uses its own tiny suite of tests to get unbounded pipeline
>> >         coverage instead of the validates runner set.
>> >         Jet[4]: WIP. Supports Read.Bounded. Read.Unbounded definitely
>> >         needs additional work.
>> >         Sazma[5]: WIP. Supports Read.Bounded. Not certain about level of
>> >         unbounded pipeline support coverage since Spark uses its own
>> >         tiny suite of tests to get unbounded pipeline coverage instead
>> >         of the validates runner set.
>> >         Flink: Unstarted.
>> >
>> >         @Pulasthi Supun Wickramasinghe <mailto:[email protected]> ,
>> >         can you help me with the Twister2 PR[2]?
>> >         @Ismaël Mejía <mailto:[email protected]>, is PR[3] the expected
>> >         level of support for unbounded pipelines and hence ready for
>> review?
>> >         @Jozsef Bartok <mailto:[email protected]>, can you help me
>> out
>> >         to get support for unbounded splittable DoFn's into Jet[4]?
>> >         @Xinyu Liu <mailto:[email protected]>, is PR[5] the
>> expected
>> >         level of support for unbounded pipelines and hence ready for
>> review?
>> >
>> >         1: https://github.com/apache/beam/pull/12519
>> >         2: https://github.com/apache/beam/pull/12594
>> >         3: https://github.com/apache/beam/pull/12603
>> >         4: https://github.com/apache/beam/pull/12616
>> >         5: https://github.com/apache/beam/pull/12617
>> >
>> >         On Tue, Aug 11, 2020 at 10:55 AM Luke Cwik <[email protected]
>> >         <mailto:[email protected]>> wrote:
>> >
>> >             There shouldn't be any changes required since the wrapper
>> >             will smoothly transition the execution to be run as an SDF.
>> >             New IOs should strongly prefer to use SDF since it should be
>> >             simpler to write and will be more flexible but they can use
>> >             the "*Source"-based APIs. Eventually we'll deprecate the
>> >             APIs but we will never stop supporting them. Eventually they
>> >             should all be migrated to use SDF and if there is another
>> >             major Beam version, we'll finally be able to remove them.
>> >
>> >             On Tue, Aug 11, 2020 at 8:40 AM Alexey Romanenko
>> >             <[email protected] <mailto:[email protected]
>> >>
>> >             wrote:
>> >
>> >                 Hi Luke,
>> >
>> >                 Great to hear about such progress on this!
>> >
>> >                 Talking about opt-out for all runners in the future,
>> >                 will it require any code changes for current
>> >                 “*Source”-based IOs or the wrappers should completely
>> >                 smooth this transition?
>> >                 Do we need to require to create new IOs only based on
>> >                 SDF or again, the wrappers should help to avoid this?
>> >
>> >>                 On 10 Aug 2020, at 22:59, Luke Cwik <[email protected]
>> >>                 <mailto:[email protected]>> wrote:
>> >>
>> >>                 In the past couple of months wrappers[1, 2] have been
>> >>                 added to the Beam Java SDK which can execute
>> >>                 BoundedSource and UnboundedSource as Splittable DoFns.
>> >>                 These have been opt-out for portable pipelines (e.g.
>> >>                 Dataflow runner v2, XLang pipelines on Flink/Spark)
>> >>                 and opt-in using an experiment for all other pipelines.
>> >>
>> >>                 I would like to start making the non-portable
>> >>                 pipelines starting with the DirectRunner[3] to be
>> >>                 opt-out with the plan that eventually all runners will
>> >>                 only execute splittable DoFns and the
>> >>                 BoundedSource/UnboundedSource specific execution logic
>> >>                 from the runners will be removed.
>> >>
>> >>                 Users will be able to opt-in any pipeline using the
>> >>                 experiment 'use_sdf_read' and opt-out with the
>> >>                 experiment 'use_deprecated_read'. (For portable
>> >>                 pipelines these experiments were 'beam_fn_api' and
>> >>                 'beam_fn_api_use_deprecated_read' respectively and I
>> >>                 have added these two additional aliases to make the
>> >>                 experience less confusing).
>> >>
>> >>                 1:
>> >>
>> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L275
>> >>                 2:
>> >>
>> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L449
>> >>                 3: https://github.com/apache/beam/pull/12519
>> >
>> >
>> >
>> >     --
>> >     Pulasthi S. Wickramasinghe
>> >     PhD Candidate  | Research Assistant
>> >     School of Informatics and Computing | Digital Science Center
>> >     Indiana University, Bloomington
>> >     cell: 224-386-9035 <(224)%20386-9035> <tel:(224)%20386-9035>
>> >
>>
>

Re: [DISCUSS][BEAM-10670] Migrating BoundedSource/UnboundedSource to execute as a Splittable DoFn for non-portable Java runners

Reply via email to