I think process context should go away completely. At that point it has little use except for a way to send output downstream.
On Sun, Mar 11, 2018, 6:07 AM Romain Manni-Bucau <rmannibu...@gmail.com> wrote: > Hmm, thinking out loud but completionstage should/could be extended to > replace processcontext since it represents element and output at the same > time no? > > Le 11 mars 2018 00:57, "Kenneth Knowles" <k...@google.com> a écrit : > >> Yea, I think it could. But it is probably more readable to not overload >> the term, plus certainly a bit simpler in implementation. So perhaps >> @AsyncElement to make it very clear. >> >> Kenn >> >> On Sat, Mar 10, 2018 at 1:32 PM Reuven Lax <re...@google.com> wrote: >> >>> Ken, can NewDoFn distinguish at generation time the difference between: >>> >>> public void process(@Element CompletionStage<InputT> element, ...) { >>> >>> and >>> >>> public void process(@Element Input element, ...) { >>> >>> If not, then we would probably need separate annotations.... >>> >>> >>> >>> >>> >>> On Sat, Mar 10, 2018 at 11:09 AM Kenneth Knowles <k...@google.com> wrote: >>> >>>> Nice! I agree that providing a CompletionStage for chaining is much >>>> better than an ExecutorService, and very clear. >>>> >>>> It is very feasible to add support that looks like >>>> >>>> new DoFn<InputT, OutputT>() { >>>> @ProcessElement >>>> public void process(@Element CompletionStage<InputT> element, ...) { >>>> element.thenApply(...) >>>> } >>>> } >>>> >>>> If we had this available, I think users could even experiment with this >>>> often as it might help even where it isn't obvious. >>>> >>>> My main hesitation is that big part of Beam is giving a >>>> basic/imperative style of programming a DoFn that executes in a very smart >>>> functional/parallel way. Full future-oriented programming is not >>>> explored much outside of Javascript (and maybe Haskell) and requires >>>> greater discipline in programming in a functional manner - if you are >>>> mutating stuff in your callback you are going to have bugs, and then when >>>> you add concurrency control you are going to have bad performance and >>>> deadlocks. So I definitely wouldn't make it the default or want to spend >>>> all our support effort on teaching advanced programming technique. >>>> >>>> Kenn >>>> >>>> On Sat, Mar 10, 2018 at 9:31 AM Romain Manni-Bucau < >>>> rmannibu...@gmail.com> wrote: >>>> >>>>> >>>>> >>>>> 2018-03-10 17:30 GMT+01:00 Reuven Lax <re...@google.com>: >>>>> >>>>>> Have you considered drafting in detail what you think this API might >>>>>> look like? >>>>>> >>>>> >>>>> >>>>> Yes, but it is after the "enhancements" - for my use cases - and >>>>> "bugs" list so didn't started to work on it much. >>>>> >>>>> >>>>>> >>>>>> If it's a radically different API, it might be more appropriate as an >>>>>> alternative parallel Beam API rather than a replacement for the current >>>>>> API >>>>>> (there is also one such fluent API in the works). >>>>>> >>>>> >>>>> What I plan is to draft it on top of beam (so the "useless" case I >>>>> spoke about before) and then propose to impl it ~natively and move it as >>>>> main API for another major. >>>>> >>>>> >>>>> >>>>> >>>>>> >>>>>> >>>>>> On Sat, Mar 10, 2018 at 7:23 AM Romain Manni-Bucau < >>>>>> rmannibu...@gmail.com> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> 2018-03-10 16:19 GMT+01:00 Reuven Lax <re...@google.com>: >>>>>>> >>>>>>>> This is another version (maybe a better, Java 8 idiomatic one?) of >>>>>>>> what Kenn suggested. >>>>>>>> >>>>>>>> Note that with NewDoFn this need not be incompatible (so might not >>>>>>>> require waiting till Beam 3.0). We can recognize new parameters to >>>>>>>> processElement and populate add needed. >>>>>>>> >>>>>>> >>>>>>> This is right however in my head it was a single way movemenent to >>>>>>> enforce the design to be reactive and not fake a reactive API with a >>>>>>> sync >>>>>>> and not reactive impl which is what would be done today with both >>>>>>> support I >>>>>>> fear. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> On Sat, Mar 10, 2018, 12:13 PM Romain Manni-Bucau < >>>>>>>> rmannibu...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Yes, for the dofn for instance, instead of having >>>>>>>>> processcontext.element()=<T> you get a CompletionStage<T> and output >>>>>>>>> gets >>>>>>>>> it as well. >>>>>>>>> >>>>>>>>> This way you register an execution chain. Mixed with streams you >>>>>>>>> get a big data java 8/9/10 API which enabkes any connectivity in a wel >>>>>>>>> performing manner ;). >>>>>>>>> >>>>>>>>> Le 10 mars 2018 13:56, "Reuven Lax" <re...@google.com> a écrit : >>>>>>>>> >>>>>>>>>> So you mean the user should have a way of registering >>>>>>>>>> asynchronous activity with a callback (the callback must be >>>>>>>>>> registered with >>>>>>>>>> Beam, because Beam needs to know not to mark the element as done >>>>>>>>>> until all >>>>>>>>>> associated callbacks have completed). I think that's basically what >>>>>>>>>> Kenn >>>>>>>>>> was suggesting, unless I'm missing something. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Mar 9, 2018 at 11:07 PM Romain Manni-Bucau < >>>>>>>>>> rmannibu...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Yes, callback based. Beam today is synchronous and until >>>>>>>>>>> bundles+combines are reactive friendly, beam will be synchronous >>>>>>>>>>> whatever >>>>>>>>>>> other parts do. Becoming reactive will enable to manage the >>>>>>>>>>> threading >>>>>>>>>>> issues properly and to have better scalability on the overall >>>>>>>>>>> execution >>>>>>>>>>> when remote IO are involved. >>>>>>>>>>> >>>>>>>>>>> However it requires to break source, sdf design to use >>>>>>>>>>> completionstage - or equivalent - to chain the processing properly >>>>>>>>>>> and in >>>>>>>>>>> an unified fashion. >>>>>>>>>>> >>>>>>>>>>> Le 9 mars 2018 23:48, "Reuven Lax" <re...@google.com> a écrit : >>>>>>>>>>> >>>>>>>>>>> If you're talking about reactive programming, at a certain level >>>>>>>>>>> beam is already reactive. Are you referring to a specific way of >>>>>>>>>>> writing >>>>>>>>>>> the code? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Mar 9, 2018 at 1:59 PM Reuven Lax <re...@google.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> What do you mean by reactive? >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Mar 9, 2018, 6:58 PM Romain Manni-Bucau < >>>>>>>>>>>> rmannibu...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> @Kenn: why not preferring to make beam reactive? Would alow to >>>>>>>>>>>>> scale way more without having to hardly synchronize >>>>>>>>>>>>> multithreading. Elegant >>>>>>>>>>>>> and efficient :). Beam 3? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Le 9 mars 2018 22:49, "Kenneth Knowles" <k...@google.com> a >>>>>>>>>>>>> écrit : >>>>>>>>>>>>> >>>>>>>>>>>>>> I will start with the "exciting futuristic" answer, which is >>>>>>>>>>>>>> that we envision the new DoFn to be able to provide an automatic >>>>>>>>>>>>>> ExecutorService parameters that you can use as you wish. >>>>>>>>>>>>>> >>>>>>>>>>>>>> new DoFn<>() { >>>>>>>>>>>>>> @ProcessElement >>>>>>>>>>>>>> public void process(ProcessContext ctx, ExecutorService >>>>>>>>>>>>>> executorService) { >>>>>>>>>>>>>> ... launch some futures, put them in instance vars >>>>>>>>>>>>>> ... >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> @FinishBundle >>>>>>>>>>>>>> public void finish(...) { >>>>>>>>>>>>>> ... block on futures, output results if appropriate >>>>>>>>>>>>>> ... >>>>>>>>>>>>>> } >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> This way, the Java SDK harness can use its overarching >>>>>>>>>>>>>> knowledge of what is going on in a computation to, for example, >>>>>>>>>>>>>> share a >>>>>>>>>>>>>> thread pool between different bits. This was one reason to delete >>>>>>>>>>>>>> IntraBundleParallelization - it didn't allow the runner and user >>>>>>>>>>>>>> code to >>>>>>>>>>>>>> properly manage how many things were going on concurrently. And >>>>>>>>>>>>>> mostly the >>>>>>>>>>>>>> runner should own parallelizing to max out cores and what user >>>>>>>>>>>>>> code needs >>>>>>>>>>>>>> is asynchrony hooks that can interact with that. However, this >>>>>>>>>>>>>> feature is >>>>>>>>>>>>>> not thoroughly considered. TBD how much the harness itself >>>>>>>>>>>>>> manages blocking >>>>>>>>>>>>>> on outstanding requests versus it being your responsibility in >>>>>>>>>>>>>> FinishBundle, etc. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I haven't explored rolling your own here, if you are willing >>>>>>>>>>>>>> to do the knob tuning to get the threading acceptable for your >>>>>>>>>>>>>> particular >>>>>>>>>>>>>> use case. Perhaps someone else can weigh in. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Kenn >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Mar 9, 2018 at 1:38 PM Josh Ferge < >>>>>>>>>>>>>> josh.fe...@bounceexchange.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hello all: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Our team has a pipeline that make external network calls. >>>>>>>>>>>>>>> These pipelines are currently super slow, and the hypothesis is >>>>>>>>>>>>>>> that they >>>>>>>>>>>>>>> are slow because we are not threading for our network calls. >>>>>>>>>>>>>>> The github >>>>>>>>>>>>>>> issue below provides some discussion around this: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://github.com/apache/beam/pull/957 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> In beam 1.0, there was IntraBundleParallelization, which >>>>>>>>>>>>>>> helped with this. However, this was removed because it didn't >>>>>>>>>>>>>>> comply with a >>>>>>>>>>>>>>> few BEAM paradigms. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Questions going forward: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> What is advised for jobs that make blocking network calls? >>>>>>>>>>>>>>> It seems bundling the elements into groups of size X prior to >>>>>>>>>>>>>>> passing to >>>>>>>>>>>>>>> the DoFn, and managing the threading within the function might >>>>>>>>>>>>>>> work. >>>>>>>>>>>>>>> thoughts? >>>>>>>>>>>>>>> Are these types of jobs even suitable for beam? >>>>>>>>>>>>>>> Are there any plans to develop features that help with this? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>> >>>>>