+1 to that. The programming guide is generally assumed to be up to date which can't be said for arbitrary blog posts. Likely more discoverable too.
On Mon, Oct 19, 2020, 10:17 AM Luke Cwik <lc...@google.com> wrote: > +Rose Nguyen <rtngu...@google.com> suggested that instead of just a blog, > we should add the majority of the current blog's content to the core > programming guide and either drop the blog and/or have a much smaller blog > that links to the docs. > > I think this is a great idea, what do others think? > > On Wed, Oct 14, 2020 at 10:51 AM Luke Cwik <lc...@google.com> wrote: > >> Thanks Alexey, that is correct. >> >> On Wed, Oct 14, 2020 at 10:33 AM Alexey Romanenko < >> aromanenko....@gmail.com> wrote: >> >>> Thanks Luke, just I guess that the proper link should be this one: >>> >>> https://docs.google.com/document/d/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE >>> >>> On 13 Oct 2020, at 00:23, Luke Cwik <lc...@google.com> wrote: >>> >>> I have a draft[1] off the blog ready. Please take a look. >>> >>> 1: >>> http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE#heading=h.tbab2n97o3eo >>> >>> On Mon, Oct 5, 2020 at 4:28 PM Luke Cwik <lc...@google.com> wrote: >>> >>>> >>>> >>>> On Mon, Oct 5, 2020 at 3:45 PM Kenneth Knowles <k...@apache.org> wrote: >>>> >>>>> >>>>> >>>>> On Mon, Oct 5, 2020 at 2:44 PM Luke Cwik <lc...@google.com> wrote: >>>>> >>>>>> For the 2.25 release the Java Direct, Flink, Jet, Samza, Twister2 >>>>>> will use SDF powered Read transforms. Users can opt-out >>>>>> with --experiments=use_deprecated_read. >>>>>> >>>>> >>>>> Huzzah! In our release notes maybe be clear about the expectations for >>>>> users: >>>>> >>>>> Done in https://github.com/apache/beam/pull/13015 >>>> >>>> >>>>> - semantics are expected to be the same: file bugs for any change in >>>>> results >>>>> - perf may vary: file bugs or write to user@ >>>>> >>>>> I was unable to get Spark done for 2.25 as I found out that Spark >>>>>> streaming doesn't support watermark holds[1]. If someone knows more about >>>>>> the watermark system in Spark I could use some guidance here as I >>>>>> believe I >>>>>> have a version of unbounded SDF support written for Spark (I get all the >>>>>> expected output from tests, just that watermarks aren't being held back >>>>>> so >>>>>> PAssert fails). >>>>>> >>>>> >>>>> Spark's watermarks are not comparable to Beam's. The rule as I >>>>> understand it is that any data that is later than `max(seen timestamps) - >>>>> allowedLateness` is dropped. One difference is that dropping is relative >>>>> to >>>>> the watermark instead of expiring windows, like early versions of Beam. >>>>> The >>>>> other difference is that it track the latest event (some call it a "high >>>>> water mark" because it is the highest datetime value seen) where Beam's >>>>> watermark is an approximation of the earliest (some call it a "low water >>>>> mark" because it is a guarantee that it will not dip lower). When I >>>>> chatted >>>>> about this with Amit in the early days, it was necessary to implement a >>>>> Beam-style watermark using Spark state. I think that may still be the >>>>> case, >>>>> for correct results. >>>>> >>>>> >>>> In the Spark implementation I saw that watermark holds weren't wired at >>>> all to control Sparks watermarks and this was causing triggers to fire too >>>> early. >>>> >>>> >>>>> Also, I started a doc[2] to produce an updated blog post since the >>>>>> original SplittableDoFn blog from 2017 is out of date[3]. I was thinking >>>>>> of >>>>>> making this a new blog post and having the old blog post point to it. We >>>>>> could also remove the old blog post and or update it. Any thoughts? >>>>>> >>>>> >>>>> New blog post w/ pointer from the old one. >>>>> >>>>> Finally, I have a clean-up PR[4] that pushes the Read -> primitive >>>>>> Read expansion into each of the runners instead of having it within Read >>>>>> transform within beam-sdks-java-core. >>>>>> >>>>> >>>>> Approved! I did CC a bunch of runner authors already. I think the >>>>> important thing is if a default changes we should be sure everyone is OK >>>>> with the perf changes, and everyone is confident that no incorrect results >>>>> are produced. The abstractions between sdk-core, runners-core-*, and >>>>> individual runners is important to me: >>>>> >>>>> - The SDK's job is to produce a portable, un-tweaked pipeline so >>>>> moving flags out of SDK core (and IOs) ASAP is super important. >>>>> - The runner's job is to execute that pipeline, if they can, however >>>>> they want. If a runner wants to run Read transforms differently/directly >>>>> that is fine. If a runner is incapable of supporting SDF, then Read is >>>>> better than nothing. Etc. >>>>> - The runners-core-* job is to just be internal libraries for runner >>>>> authors to share code, and should not make any decisions about the Beam >>>>> model, etc. >>>>> >>>>> Kenn >>>>> >>>>> 1: https://github.com/apache/beam/pull/12603 >>>>>> 2: http://doc/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE >>>>>> 3: https://beam.apache.org/blog/splittable-do-fn/ >>>>>> 4: https://github.com/apache/beam/pull/13006 >>>>>> >>>>>> >>>>>> On Fri, Aug 28, 2020 at 1:45 AM Maximilian Michels <m...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Thanks Luke! I've had a pass. >>>>>>> >>>>>>> -Max >>>>>>> >>>>>>> On 28.08.20 01:22, Luke Cwik wrote: >>>>>>> > As an update. >>>>>>> > >>>>>>> > Direct and Twister2 are done. >>>>>>> > Samza: is ready for review[1]. >>>>>>> > Flink: is almost ready for review. [2] lays all the groundwork for >>>>>>> the >>>>>>> > migration and [3] finishes the migration (there is a timeout >>>>>>> happening >>>>>>> > in FlinkSubmissionTest that I'm trying to figure out). >>>>>>> > No further updates on Spark[4] or Jet[5]. >>>>>>> > >>>>>>> > @Maximilian Michels <mailto:m...@apache.org> or @t...@apache.org >>>>>>> > <mailto:thomas.we...@gmail.com>, can either of you take a look at >>>>>>> the >>>>>>> > Flink PRs? >>>>>>> > @ke.wu...@icloud.com <mailto:ke.wu...@icloud.com>, Since Xinyu >>>>>>> delegated >>>>>>> > to you, can you take another look at the Samza PR? >>>>>>> > >>>>>>> > 1: https://github.com/apache/beam/pull/12617 >>>>>>> > 2: https://github.com/apache/beam/pull/12706 >>>>>>> > 3: https://github.com/apache/beam/pull/12708 >>>>>>> > 4: https://github.com/apache/beam/pull/12603 >>>>>>> > 5: https://github.com/apache/beam/pull/12616 >>>>>>> > >>>>>>> > On Tue, Aug 18, 2020 at 11:42 AM Pulasthi Supun Wickramasinghe >>>>>>> > <pulasthi...@gmail.com <mailto:pulasthi...@gmail.com>> wrote: >>>>>>> > >>>>>>> > Hi Luke >>>>>>> > >>>>>>> > Will take a look at this as soon as possible and get back to >>>>>>> you. >>>>>>> > >>>>>>> > Best Regards, >>>>>>> > Pulasthi >>>>>>> > >>>>>>> > On Tue, Aug 18, 2020 at 2:30 PM Luke Cwik <lc...@google.com >>>>>>> > <mailto:lc...@google.com>> wrote: >>>>>>> > >>>>>>> > I have made some good progress here and have gotten to the >>>>>>> > following state for non-portable runners: >>>>>>> > >>>>>>> > DirectRunner[1]: Merged. Supports Read.Bounded and >>>>>>> Read.Unbounded. >>>>>>> > Twister2[2]: Ready for review. Supports Read.Bounded, the >>>>>>> > current runner doesn't support unbounded pipelines. >>>>>>> > Spark[3]: WIP. Supports Read.Bounded, Nexmark suite >>>>>>> passes. Not >>>>>>> > certain about level of unbounded pipeline support coverage >>>>>>> since >>>>>>> > Spark uses its own tiny suite of tests to get unbounded >>>>>>> pipeline >>>>>>> > coverage instead of the validates runner set. >>>>>>> > Jet[4]: WIP. Supports Read.Bounded. Read.Unbounded >>>>>>> definitely >>>>>>> > needs additional work. >>>>>>> > Sazma[5]: WIP. Supports Read.Bounded. Not certain about >>>>>>> level of >>>>>>> > unbounded pipeline support coverage since Spark uses its >>>>>>> own >>>>>>> > tiny suite of tests to get unbounded pipeline coverage >>>>>>> instead >>>>>>> > of the validates runner set. >>>>>>> > Flink: Unstarted. >>>>>>> > >>>>>>> > @Pulasthi Supun Wickramasinghe <mailto: >>>>>>> pulasthi...@gmail.com> , >>>>>>> > can you help me with the Twister2 PR[2]? >>>>>>> > @Ismaël Mejía <mailto:ieme...@gmail.com>, is PR[3] the >>>>>>> expected >>>>>>> > level of support for unbounded pipelines and hence ready >>>>>>> for review? >>>>>>> > @Jozsef Bartok <mailto:jo...@hazelcast.com>, can you help >>>>>>> me out >>>>>>> > to get support for unbounded splittable DoFn's into Jet[4]? >>>>>>> > @Xinyu Liu <mailto:xinyuliu...@gmail.com>, is PR[5] the >>>>>>> expected >>>>>>> > level of support for unbounded pipelines and hence ready >>>>>>> for review? >>>>>>> > >>>>>>> > 1: https://github.com/apache/beam/pull/12519 >>>>>>> > 2: https://github.com/apache/beam/pull/12594 >>>>>>> > 3: https://github.com/apache/beam/pull/12603 >>>>>>> > 4: https://github.com/apache/beam/pull/12616 >>>>>>> > 5: https://github.com/apache/beam/pull/12617 >>>>>>> > >>>>>>> > On Tue, Aug 11, 2020 at 10:55 AM Luke Cwik < >>>>>>> lc...@google.com >>>>>>> > <mailto:lc...@google.com>> wrote: >>>>>>> > >>>>>>> > There shouldn't be any changes required since the >>>>>>> wrapper >>>>>>> > will smoothly transition the execution to be run as an >>>>>>> SDF. >>>>>>> > New IOs should strongly prefer to use SDF since it >>>>>>> should be >>>>>>> > simpler to write and will be more flexible but they >>>>>>> can use >>>>>>> > the "*Source"-based APIs. Eventually we'll deprecate >>>>>>> the >>>>>>> > APIs but we will never stop supporting them. >>>>>>> Eventually they >>>>>>> > should all be migrated to use SDF and if there is >>>>>>> another >>>>>>> > major Beam version, we'll finally be able to remove >>>>>>> them. >>>>>>> > >>>>>>> > On Tue, Aug 11, 2020 at 8:40 AM Alexey Romanenko >>>>>>> > <aromanenko....@gmail.com <mailto: >>>>>>> aromanenko....@gmail.com>> >>>>>>> > wrote: >>>>>>> > >>>>>>> > Hi Luke, >>>>>>> > >>>>>>> > Great to hear about such progress on this! >>>>>>> > >>>>>>> > Talking about opt-out for all runners in the >>>>>>> future, >>>>>>> > will it require any code changes for current >>>>>>> > “*Source”-based IOs or the wrappers should >>>>>>> completely >>>>>>> > smooth this transition? >>>>>>> > Do we need to require to create new IOs only based >>>>>>> on >>>>>>> > SDF or again, the wrappers should help to avoid >>>>>>> this? >>>>>>> > >>>>>>> >> On 10 Aug 2020, at 22:59, Luke Cwik < >>>>>>> lc...@google.com >>>>>>> >> <mailto:lc...@google.com>> wrote: >>>>>>> >> >>>>>>> >> In the past couple of months wrappers[1, 2] have >>>>>>> been >>>>>>> >> added to the Beam Java SDK which can execute >>>>>>> >> BoundedSource and UnboundedSource as Splittable >>>>>>> DoFns. >>>>>>> >> These have been opt-out for portable pipelines >>>>>>> (e.g. >>>>>>> >> Dataflow runner v2, XLang pipelines on >>>>>>> Flink/Spark) >>>>>>> >> and opt-in using an experiment for all other >>>>>>> pipelines. >>>>>>> >> >>>>>>> >> I would like to start making the non-portable >>>>>>> >> pipelines starting with the DirectRunner[3] to be >>>>>>> >> opt-out with the plan that eventually all runners >>>>>>> will >>>>>>> >> only execute splittable DoFns and the >>>>>>> >> BoundedSource/UnboundedSource specific execution >>>>>>> logic >>>>>>> >> from the runners will be removed. >>>>>>> >> >>>>>>> >> Users will be able to opt-in any pipeline using >>>>>>> the >>>>>>> >> experiment 'use_sdf_read' and opt-out with the >>>>>>> >> experiment 'use_deprecated_read'. (For portable >>>>>>> >> pipelines these experiments were 'beam_fn_api' and >>>>>>> >> 'beam_fn_api_use_deprecated_read' respectively >>>>>>> and I >>>>>>> >> have added these two additional aliases to make >>>>>>> the >>>>>>> >> experience less confusing). >>>>>>> >> >>>>>>> >> 1: >>>>>>> >> >>>>>>> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L275 >>>>>>> >> 2: >>>>>>> >> >>>>>>> https://github.com/apache/beam/blob/af1ce643d8fde5352d4519a558de4a2dfd24721d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L449 >>>>>>> >> 3: https://github.com/apache/beam/pull/12519 >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > -- >>>>>>> > Pulasthi S. Wickramasinghe >>>>>>> > PhD Candidate | Research Assistant >>>>>>> > School of Informatics and Computing | Digital Science Center >>>>>>> > Indiana University, Bloomington >>>>>>> > cell: 224-386-9035 <(224)%20386-9035> <tel:(224)%20386-9035> >>>>>>> > >>>>>>> >>>>>> >>>