Re: [DISCUSS][BEAM-10670] Migrating BoundedSource/UnboundedSource to execute as a Splittable DoFn for non-portable Java runners

Luke Cwik Wed, 14 Oct 2020 10:52:09 -0700

Thanks Alexey, that is correct.

On Wed, Oct 14, 2020 at 10:33 AM Alexey Romanenko <[email protected]>
wrote:


> Thanks Luke, just I guess that the proper link should be this one:
>
> https://docs.google.com/document/d/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE
>
> On 13 Oct 2020, at 00:23, Luke Cwik <[email protected]> wrote:
>
> I have a draft[1] off the blog ready. Please take a look.
>
> 1:
> http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE#heading=h.tbab2n97o3eo
>
> On Mon, Oct 5, 2020 at 4:28 PM Luke Cwik <[email protected]> wrote:
>
>>
>>
>> On Mon, Oct 5, 2020 at 3:45 PM Kenneth Knowles <[email protected]> wrote:
>>
>>>
>>>
>>> On Mon, Oct 5, 2020 at 2:44 PM Luke Cwik <[email protected]> wrote:
>>>
>>>> For the 2.25 release the Java Direct, Flink, Jet, Samza, Twister2 will
>>>> use SDF powered Read transforms. Users can opt-out
>>>> with --experiments=use_deprecated_read.
>>>>
>>>
>>> Huzzah! In our release notes maybe be clear about the expectations for
>>> users:
>>>
>>> Done in https://github.com/apache/beam/pull/13015
>>
>>
>>>  - semantics are expected to be the same: file bugs for any change in
>>> results
>>>  - perf may vary: file bugs or write to user@
>>>
>>> I was unable to get Spark done for 2.25 as I found out that Spark
>>>> streaming doesn't support watermark holds[1]. If someone knows more about
>>>> the watermark system in Spark I could use some guidance here as I believe I
>>>> have a version of unbounded SDF support written for Spark (I get all the
>>>> expected output from tests, just that watermarks aren't being held back so
>>>> PAssert fails).
>>>>
>>>
>>> Spark's watermarks are not comparable to Beam's. The rule as I
>>> understand it is that any data that is later than `max(seen timestamps) -
>>> allowedLateness` is dropped. One difference is that dropping is relative to
>>> the watermark instead of expiring windows, like early versions of Beam. The
>>> other difference is that it track the latest event (some call it a "high
>>> water mark" because it is the highest datetime value seen) where Beam's
>>> watermark is an approximation of the earliest (some call it a "low water
>>> mark" because it is a guarantee that it will not dip lower). When I chatted
>>> about this with Amit in the early days, it was necessary to implement a
>>> Beam-style watermark using Spark state. I think that may still be the case,
>>> for correct results.
>>>
>>>
>> In the Spark implementation I saw that watermark holds weren't wired at
>> all to control Sparks watermarks and this was causing triggers to fire too
>> early.
>>
>>
>>> Also, I started a doc[2] to produce an updated blog post since the
>>>> original SplittableDoFn blog from 2017 is out of date[3]. I was thinking of
>>>> making this a new blog post and having the old blog post point to it. We
>>>> could also remove the old blog post and or update it. Any thoughts?
>>>>
>>>
>>> New blog post w/ pointer from the old one.
>>>
>>> Finally, I have a clean-up PR[4] that pushes the Read -> primitive Read
>>>> expansion into each of the runners instead of having it within Read
>>>> transform within beam-sdks-java-core.
>>>>
>>>
>>> Approved! I did CC a bunch of runner authors already. I think the
>>> important thing is if a default changes we should be sure everyone is OK
>>> with the perf changes, and everyone is confident that no incorrect results
>>> are produced. The abstractions between sdk-core, runners-core-*, and
>>> individual runners is important to me:
>>>
>>>  - The SDK's job is to produce a portable, un-tweaked pipeline so moving
>>> flags out of SDK core (and IOs) ASAP is super important.
>>>  - The runner's job is to execute that pipeline, if they can, however
>>> they want. If a runner wants to run Read transforms differently/directly
>>> that is fine. If a runner is incapable of supporting SDF, then Read is
>>> better than nothing. Etc.
>>>  - The runners-core-* job is to just be internal libraries for runner
>>> authors to share code, and should not make any decisions about the Beam
>>> model, etc.
>>>
>>> Kenn
>>>
>>> 1: https://github.com/apache/beam/pull/12603
>>>> 2: http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE
>>>> 3: https://beam.apache.org/blog/splittable-do-fn/
>>>> 4: https://github.com/apache/beam/pull/13006
>>>>
>>>>
>>>> On Fri, Aug 28, 2020 at 1:45 AM Maximilian Michels <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks Luke! I've had a pass.
>>>>>
>>>>> -Max
>>>>>
>>>>> On 28.08.20 01:22, Luke Cwik wrote:
>>>>> > As an update.
>>>>> >
>>>>> > Direct and Twister2 are done.
>>>>> > Samza: is ready for review[1].
>>>>> > Flink: is almost ready for review. [2] lays all the groundwork for
>>>>> the
>>>>> > migration and [3] finishes the migration (there is a timeout
>>>>> happening
>>>>> > in FlinkSubmissionTest that I'm trying to figure out).
>>>>> > No further updates on Spark[4] or Jet[5].
>>>>> >
>>>>> > @Maximilian Michels <mailto:[email protected]> or @[email protected]
>>>>> > <mailto:[email protected]>, can either of you take a look at
>>>>> the
>>>>> > Flink PRs?
>>>>> > @[email protected] <mailto:[email protected]>, Since Xinyu
>>>>> delegated
>>>>> > to you, can you take another look at the Samza PR?
>>>>> >
>>>>> > 1: https://github.com/apache/beam/pull/12617
>>>>> > 2: https://github.com/apache/beam/pull/12706
>>>>> > 3: https://github.com/apache/beam/pull/12708
>>>>> > 4: https://github.com/apache/beam/pull/12603
>>>>> > 5: https://github.com/apache/beam/pull/12616
>>>>> >
>>>>> > On Tue, Aug 18, 2020 at 11:42 AM Pulasthi Supun Wickramasinghe
>>>>> > <[email protected] <mailto:[email protected]>> wrote:
>>>>> >
>>>>> >     Hi Luke
>>>>> >
>>>>> >     Will take a look at this as soon as possible and get back to you.
>>>>> >
>>>>> >     Best Regards,
>>>>> >     Pulasthi
>>>>> >
>>>>> >     On Tue, Aug 18, 2020 at 2:30 PM Luke Cwik <[email protected]
>>>>> >     <mailto:[email protected]>> wrote:
>>>>> >
>>>>> >         I have made some good progress here and have gotten to the
>>>>> >         following state for non-portable runners:
>>>>> >
>>>>> >         DirectRunner[1]: Merged. Supports Read.Bounded and
>>>>> Read.Unbounded.
>>>>> >         Twister2[2]: Ready for review. Supports Read.Bounded, the
>>>>> >         current runner doesn't support unbounded pipelines.
>>>>> >         Spark[3]: WIP. Supports Read.Bounded, Nexmark suite passes.
>>>>> Not
>>>>> >         certain about level of unbounded pipeline support coverage
>>>>> since
>>>>> >         Spark uses its own tiny suite of tests to get unbounded
>>>>> pipeline
>>>>> >         coverage instead of the validates runner set.
>>>>> >         Jet[4]: WIP. Supports Read.Bounded. Read.Unbounded definitely
>>>>> >         needs additional work.
>>>>> >         Sazma[5]: WIP. Supports Read.Bounded. Not certain about
>>>>> level of
>>>>> >         unbounded pipeline support coverage since Spark uses its own
>>>>> >         tiny suite of tests to get unbounded pipeline coverage
>>>>> instead
>>>>> >         of the validates runner set.
>>>>> >         Flink: Unstarted.
>>>>> >
>>>>> >         @Pulasthi Supun Wickramasinghe <mailto:[email protected]
>>>>> > ,
>>>>> >         can you help me with the Twister2 PR[2]?
>>>>> >         @Ismaël Mejía <mailto:[email protected]>, is PR[3] the
>>>>> expected
>>>>> >         level of support for unbounded pipelines and hence ready for
>>>>> review?
>>>>> >         @Jozsef Bartok <mailto:[email protected]>, can you help
>>>>> me out
>>>>> >         to get support for unbounded splittable DoFn's into Jet[4]?
>>>>> >         @Xinyu Liu <mailto:[email protected]>, is PR[5] the
>>>>> expected
>>>>> >         level of support for unbounded pipelines and hence ready for
>>>>> review?
>>>>> >
>>>>> >         1: https://github.com/apache/beam/pull/12519
>>>>> >         2: https://github.com/apache/beam/pull/12594
>>>>> >         3: https://github.com/apache/beam/pull/12603
>>>>> >         4: https://github.com/apache/beam/pull/12616
>>>>> >         5: https://github.com/apache/beam/pull/12617
>>>>> >
>>>>> >         On Tue, Aug 11, 2020 at 10:55 AM Luke Cwik <[email protected]
>>>>> >         <mailto:[email protected]>> wrote:
>>>>> >
>>>>> >             There shouldn't be any changes required since the wrapper
>>>>> >             will smoothly transition the execution to be run as an
>>>>> SDF.
>>>>> >             New IOs should strongly prefer to use SDF since it
>>>>> should be
>>>>> >             simpler to write and will be more flexible but they can
>>>>> use
>>>>> >             the "*Source"-based APIs. Eventually we'll deprecate the
>>>>> >             APIs but we will never stop supporting them. Eventually
>>>>> they
>>>>> >             should all be migrated to use SDF and if there is another
>>>>> >             major Beam version, we'll finally be able to remove them.
>>>>> >
>>>>> >             On Tue, Aug 11, 2020 at 8:40 AM Alexey Romanenko
>>>>> >             <[email protected] <mailto:
>>>>> [email protected]>>
>>>>> >             wrote:
>>>>> >
>>>>> >                 Hi Luke,
>>>>> >
>>>>> >                 Great to hear about such progress on this!
>>>>> >
>>>>> >                 Talking about opt-out for all runners in the future,
>>>>> >                 will it require any code changes for current
>>>>> >                 “*Source”-based IOs or the wrappers should completely
>>>>> >                 smooth this transition?
>>>>> >                 Do we need to require to create new IOs only based on
>>>>> >                 SDF or again, the wrappers should help to avoid this?
>>>>> >
>>>>> >>                 On 10 Aug 2020, at 22:59, Luke Cwik <
>>>>> [email protected]
>>>>> >>                 <mailto:[email protected]>> wrote:
>>>>> >>
>>>>> >>                 In the past couple of months wrappers[1, 2] have
>>>>> been
>>>>> >>                 added to the Beam Java SDK which can execute
>>>>> >>                 BoundedSource and UnboundedSource as Splittable
>>>>> DoFns.
>>>>> >>                 These have been opt-out for portable pipelines (e.g.
>>>>> >>                 Dataflow runner v2, XLang pipelines on Flink/Spark)
>>>>> >>                 and opt-in using an experiment for all other
>>>>> pipelines.
>>>>> >>
>>>>> >>                 I would like to start making the non-portable
>>>>> >>                 pipelines starting with the DirectRunner[3] to be
>>>>> >>                 opt-out with the plan that eventually all runners
>>>>> will
>>>>> >>                 only execute splittable DoFns and the
>>>>> >>                 BoundedSource/UnboundedSource specific execution
>>>>> logic
>>>>> >>                 from the runners will be removed.
>>>>> >>
>>>>> >>                 Users will be able to opt-in any pipeline using the
>>>>> >>                 experiment 'use_sdf_read' and opt-out with the
>>>>> >>                 experiment 'use_deprecated_read'. (For portable
>>>>> >>                 pipelines these experiments were 'beam_fn_api' and
>>>>> >>                 'beam_fn_api_use_deprecated_read' respectively and I
>>>>> >>                 have added these two additional aliases to make the
>>>>> >>                 experience less confusing).
>>>>> >>
>>>>> >>                 1:
>>>>> >>
>>>>> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L275
>>>>> >>                 2:
>>>>> >>
>>>>> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L449
>>>>> >>                 3: https://github.com/apache/beam/pull/12519
>>>>> >
>>>>> >
>>>>> >
>>>>> >     --
>>>>> >     Pulasthi S. Wickramasinghe
>>>>> >     PhD Candidate  | Research Assistant
>>>>> >     School of Informatics and Computing | Digital Science Center
>>>>> >     Indiana University, Bloomington
>>>>> >     cell: 224-386-9035 <(224)%20386-9035> <tel:(224)%20386-9035>
>>>>> >
>>>>>
>>>>
>

Re: [DISCUSS][BEAM-10670] Migrating BoundedSource/UnboundedSource to execute as a Splittable DoFn for non-portable Java runners

Reply via email to