" and a transform is by design bound to an execution" What do you mean by execution?
On Sat, Feb 17, 2018 at 12:50 AM, Romain Manni-Bucau <rmannibu...@gmail.com> wrote: > > > Le 16 févr. 2018 22:41, "Reuven Lax" <re...@google.com> a écrit : > > Kenn is correct. Allowing Fn reuse across bundles was a major, major > performance improvement. Profiling on the old Dataflow SDKs consistently > showed Java serialization being the number one performance bottleneck for > streaming pipelines, and Beam fixed this. > > > Sorry but this doesnt help me much to understand. Let me try to explain. I > read it as "we were slow somehow around serialization so a quick fix was > caching". > > It is not to be picky but i had a lot of remote ejb over rmi super fast > setup do java serialization is slower than alternative serialization, > right, but doesnt justify caching most of the time. > > My main interrogation is: isnt beam which is designed to be slow in the > way it designed the dofn/transform and therefore serializes way more than > it requires - you never care to serialize the full transform and can in 95% > do a writeReplace which is light and fast compared to the default. > > If so the cache is an implementation workaround and not a fix. > > Hope my view is clearer on it. > > > > Romain - can you state precisely what you want? I do think there is still > a gap - IMO there's a place for a longer-lived per-fn container; evidence > for this is that people still often need to use statics to store things. > However I'm not sure if this is what you're looking for. > > > Yes. I build a framework on top of beam and must be able to provide a > lifecycle clear and reliable. The bare minimum for any user is > start-exec-stop and a transform is by design bound to an execution (stream > or batch). > > Bundles are not an option as explained cause not bound to the execution > but an uncontrolled subpart. You can see it as a beam internal until > runners unify this definition. And in any case it is closer to a chunk > notion than a lifecycle one. > > So setup and teardown must be symmetric. > > Note that a dofn instance owns a config so is bound to an execution. > > This all lead to the nees of a reliable teardown. > > Caching can be neat bit requires it own api like passivation one of ejbs. > > > > Reuven > > On Fri, Feb 16, 2018 at 1:33 PM, Kenneth Knowles <k...@google.com> wrote: > >> On Fri, Feb 16, 2018 at 1:00 PM, Romain Manni-Bucau < >> rmannibu...@gmail.com> wrote: >>> >>> The serialization of fn being once per bundle, the perf impact is only >>> huge if there is a bug somewhere else, even java serialization is >>> negligeable on big config compared to any small pipeline (seconds vs >>> minutes). >>> >> >> Profiling is clear that this is a huge performance impact. One of the >> most important backwards-incompatible changes we made for Beam 2.0.0 was to >> allow Fn reuse across bundles. >> >> When we used a DoFn only for one bundle, there was no @Teardown because >> it has ~no use. You do everything in @FinishBundle. So for whatever use >> case you are working on, if your pipeline performs well enough doing it per >> bundle, you can put it in @FinishBundle. Of course it still might not get >> called because that is a logical impossibility - you just know that for a >> given element the element will be retried if @FinishBundle fails. >> >> If you have cleanup logic that absolutely must get executed, then you >> need to build a composite PTransform around it so it will be retried until >> cleanup succeeds. In Beam's sinks you can find many examples. >> >> Kenn >> >> > >