Re: [DISCUSS][BEAM-10670] Migrating BoundedSource/UnboundedSource to execute as a Splittable DoFn for non-portable Java runners

Luke Cwik Mon, 12 Oct 2020 15:23:59 -0700

I have a draft[1] off the blog ready. Please take a look.

1:
http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE#heading=h.tbab2n97o3eo


On Mon, Oct 5, 2020 at 4:28 PM Luke Cwik <lc...@google.com> wrote:

>
>
> On Mon, Oct 5, 2020 at 3:45 PM Kenneth Knowles <k...@apache.org> wrote:
>
>>
>>
>> On Mon, Oct 5, 2020 at 2:44 PM Luke Cwik <lc...@google.com> wrote:
>>
>>> For the 2.25 release the Java Direct, Flink, Jet, Samza, Twister2 will
>>> use SDF powered Read transforms. Users can opt-out
>>> with --experiments=use_deprecated_read.
>>>
>>
>> Huzzah! In our release notes maybe be clear about the expectations for
>> users:
>>
>> Done in https://github.com/apache/beam/pull/13015
>
>
>>  - semantics are expected to be the same: file bugs for any change in
>> results
>>  - perf may vary: file bugs or write to user@
>>
>> I was unable to get Spark done for 2.25 as I found out that Spark
>>> streaming doesn't support watermark holds[1]. If someone knows more about
>>> the watermark system in Spark I could use some guidance here as I believe I
>>> have a version of unbounded SDF support written for Spark (I get all the
>>> expected output from tests, just that watermarks aren't being held back so
>>> PAssert fails).
>>>
>>
>> Spark's watermarks are not comparable to Beam's. The rule as I understand
>> it is that any data that is later than `max(seen timestamps) -
>> allowedLateness` is dropped. One difference is that dropping is relative to
>> the watermark instead of expiring windows, like early versions of Beam. The
>> other difference is that it track the latest event (some call it a "high
>> water mark" because it is the highest datetime value seen) where Beam's
>> watermark is an approximation of the earliest (some call it a "low water
>> mark" because it is a guarantee that it will not dip lower). When I chatted
>> about this with Amit in the early days, it was necessary to implement a
>> Beam-style watermark using Spark state. I think that may still be the case,
>> for correct results.
>>
>>
> In the Spark implementation I saw that watermark holds weren't wired at
> all to control Sparks watermarks and this was causing triggers to fire too
> early.
>
>
>> Also, I started a doc[2] to produce an updated blog post since the
>>> original SplittableDoFn blog from 2017 is out of date[3]. I was thinking of
>>> making this a new blog post and having the old blog post point to it. We
>>> could also remove the old blog post and or update it. Any thoughts?
>>>
>>
>> New blog post w/ pointer from the old one.
>>
>> Finally, I have a clean-up PR[4] that pushes the Read -> primitive Read
>>> expansion into each of the runners instead of having it within Read
>>> transform within beam-sdks-java-core.
>>>
>>
>> Approved! I did CC a bunch of runner authors already. I think the
>> important thing is if a default changes we should be sure everyone is OK
>> with the perf changes, and everyone is confident that no incorrect results
>> are produced. The abstractions between sdk-core, runners-core-*, and
>> individual runners is important to me:
>>
>>  - The SDK's job is to produce a portable, un-tweaked pipeline so moving
>> flags out of SDK core (and IOs) ASAP is super important.
>>  - The runner's job is to execute that pipeline, if they can, however
>> they want. If a runner wants to run Read transforms differently/directly
>> that is fine. If a runner is incapable of supporting SDF, then Read is
>> better than nothing. Etc.
>>  - The runners-core-* job is to just be internal libraries for runner
>> authors to share code, and should not make any decisions about the Beam
>> model, etc.
>>
>> Kenn
>>
>> 1: https://github.com/apache/beam/pull/12603
>>> 2: http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE
>>> 3: https://beam.apache.org/blog/splittable-do-fn/
>>> 4: https://github.com/apache/beam/pull/13006
>>>
>>>
>>> On Fri, Aug 28, 2020 at 1:45 AM Maximilian Michels <m...@apache.org>
>>> wrote:
>>>
>>>> Thanks Luke! I've had a pass.
>>>>
>>>> -Max
>>>>
>>>> On 28.08.20 01:22, Luke Cwik wrote:
>>>> > As an update.
>>>> >
>>>> > Direct and Twister2 are done.
>>>> > Samza: is ready for review[1].
>>>> > Flink: is almost ready for review. [2] lays all the groundwork for
>>>> the
>>>> > migration and [3] finishes the migration (there is a timeout
>>>> happening
>>>> > in FlinkSubmissionTest that I'm trying to figure out).
>>>> > No further updates on Spark[4] or Jet[5].
>>>> >
>>>> > @Maximilian Michels <mailto:m...@apache.org> or @t...@apache.org
>>>> > <mailto:thomas.we...@gmail.com>, can either of you take a look at
>>>> the
>>>> > Flink PRs?
>>>> > @ke.wu...@icloud.com <mailto:ke.wu...@icloud.com>, Since Xinyu
>>>> delegated
>>>> > to you, can you take another look at the Samza PR?
>>>> >
>>>> > 1: https://github.com/apache/beam/pull/12617
>>>> > 2: https://github.com/apache/beam/pull/12706
>>>> > 3: https://github.com/apache/beam/pull/12708
>>>> > 4: https://github.com/apache/beam/pull/12603
>>>> > 5: https://github.com/apache/beam/pull/12616
>>>> >
>>>> > On Tue, Aug 18, 2020 at 11:42 AM Pulasthi Supun Wickramasinghe
>>>> > <pulasthi...@gmail.com <mailto:pulasthi...@gmail.com>> wrote:
>>>> >
>>>> >     Hi Luke
>>>> >
>>>> >     Will take a look at this as soon as possible and get back to you.
>>>> >
>>>> >     Best Regards,
>>>> >     Pulasthi
>>>> >
>>>> >     On Tue, Aug 18, 2020 at 2:30 PM Luke Cwik <lc...@google.com
>>>> >     <mailto:lc...@google.com>> wrote:
>>>> >
>>>> >         I have made some good progress here and have gotten to the
>>>> >         following state for non-portable runners:
>>>> >
>>>> >         DirectRunner[1]: Merged. Supports Read.Bounded and
>>>> Read.Unbounded.
>>>> >         Twister2[2]: Ready for review. Supports Read.Bounded, the
>>>> >         current runner doesn't support unbounded pipelines.
>>>> >         Spark[3]: WIP. Supports Read.Bounded, Nexmark suite passes.
>>>> Not
>>>> >         certain about level of unbounded pipeline support coverage
>>>> since
>>>> >         Spark uses its own tiny suite of tests to get unbounded
>>>> pipeline
>>>> >         coverage instead of the validates runner set.
>>>> >         Jet[4]: WIP. Supports Read.Bounded. Read.Unbounded definitely
>>>> >         needs additional work.
>>>> >         Sazma[5]: WIP. Supports Read.Bounded. Not certain about level
>>>> of
>>>> >         unbounded pipeline support coverage since Spark uses its own
>>>> >         tiny suite of tests to get unbounded pipeline coverage instead
>>>> >         of the validates runner set.
>>>> >         Flink: Unstarted.
>>>> >
>>>> >         @Pulasthi Supun Wickramasinghe <mailto:pulasthi...@gmail.com
>>>> > ,
>>>> >         can you help me with the Twister2 PR[2]?
>>>> >         @Ismaël Mejía <mailto:ieme...@gmail.com>, is PR[3] the
>>>> expected
>>>> >         level of support for unbounded pipelines and hence ready for
>>>> review?
>>>> >         @Jozsef Bartok <mailto:jo...@hazelcast.com>, can you help me
>>>> out
>>>> >         to get support for unbounded splittable DoFn's into Jet[4]?
>>>> >         @Xinyu Liu <mailto:xinyuliu...@gmail.com>, is PR[5] the
>>>> expected
>>>> >         level of support for unbounded pipelines and hence ready for
>>>> review?
>>>> >
>>>> >         1: https://github.com/apache/beam/pull/12519
>>>> >         2: https://github.com/apache/beam/pull/12594
>>>> >         3: https://github.com/apache/beam/pull/12603
>>>> >         4: https://github.com/apache/beam/pull/12616
>>>> >         5: https://github.com/apache/beam/pull/12617
>>>> >
>>>> >         On Tue, Aug 11, 2020 at 10:55 AM Luke Cwik <lc...@google.com
>>>> >         <mailto:lc...@google.com>> wrote:
>>>> >
>>>> >             There shouldn't be any changes required since the wrapper
>>>> >             will smoothly transition the execution to be run as an
>>>> SDF.
>>>> >             New IOs should strongly prefer to use SDF since it should
>>>> be
>>>> >             simpler to write and will be more flexible but they can
>>>> use
>>>> >             the "*Source"-based APIs. Eventually we'll deprecate the
>>>> >             APIs but we will never stop supporting them. Eventually
>>>> they
>>>> >             should all be migrated to use SDF and if there is another
>>>> >             major Beam version, we'll finally be able to remove them.
>>>> >
>>>> >             On Tue, Aug 11, 2020 at 8:40 AM Alexey Romanenko
>>>> >             <aromanenko....@gmail.com <mailto:
>>>> aromanenko....@gmail.com>>
>>>> >             wrote:
>>>> >
>>>> >                 Hi Luke,
>>>> >
>>>> >                 Great to hear about such progress on this!
>>>> >
>>>> >                 Talking about opt-out for all runners in the future,
>>>> >                 will it require any code changes for current
>>>> >                 “*Source”-based IOs or the wrappers should completely
>>>> >                 smooth this transition?
>>>> >                 Do we need to require to create new IOs only based on
>>>> >                 SDF or again, the wrappers should help to avoid this?
>>>> >
>>>> >>                 On 10 Aug 2020, at 22:59, Luke Cwik <
>>>> lc...@google.com
>>>> >>                 <mailto:lc...@google.com>> wrote:
>>>> >>
>>>> >>                 In the past couple of months wrappers[1, 2] have been
>>>> >>                 added to the Beam Java SDK which can execute
>>>> >>                 BoundedSource and UnboundedSource as Splittable
>>>> DoFns.
>>>> >>                 These have been opt-out for portable pipelines (e.g.
>>>> >>                 Dataflow runner v2, XLang pipelines on Flink/Spark)
>>>> >>                 and opt-in using an experiment for all other
>>>> pipelines.
>>>> >>
>>>> >>                 I would like to start making the non-portable
>>>> >>                 pipelines starting with the DirectRunner[3] to be
>>>> >>                 opt-out with the plan that eventually all runners
>>>> will
>>>> >>                 only execute splittable DoFns and the
>>>> >>                 BoundedSource/UnboundedSource specific execution
>>>> logic
>>>> >>                 from the runners will be removed.
>>>> >>
>>>> >>                 Users will be able to opt-in any pipeline using the
>>>> >>                 experiment 'use_sdf_read' and opt-out with the
>>>> >>                 experiment 'use_deprecated_read'. (For portable
>>>> >>                 pipelines these experiments were 'beam_fn_api' and
>>>> >>                 'beam_fn_api_use_deprecated_read' respectively and I
>>>> >>                 have added these two additional aliases to make the
>>>> >>                 experience less confusing).
>>>> >>
>>>> >>                 1:
>>>> >>
>>>> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L275
>>>> >>                 2:
>>>> >>
>>>> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L449
>>>> >>                 3: https://github.com/apache/beam/pull/12519
>>>> >
>>>> >
>>>> >
>>>> >     --
>>>> >     Pulasthi S. Wickramasinghe
>>>> >     PhD Candidate  | Research Assistant
>>>> >     School of Informatics and Computing | Digital Science Center
>>>> >     Indiana University, Bloomington
>>>> >     cell: 224-386-9035 <(224)%20386-9035> <tel:(224)%20386-9035>
>>>> >
>>>>
>>>

Re: [DISCUSS][BEAM-10670] Migrating BoundedSource/UnboundedSource to execute as a Splittable DoFn for non-portable Java runners

Reply via email to