You phrased it right Eugene - thanks for that. However the solution is not functional I think - hope I missed something. With distribution etc you cant use by reference param passing, therefore no way to clean up the internal states of another fn. So i kind of feel back to the original need :(.
Also interested in a probably stupid question: why teardown/setup/startbundle/finishbundle are in the api if it is not usable? Dont we want as a portable lib fix that? Once again i see to technical blocker - checked flink/spark/direct runners - to make it usable so why not simplifying user lifes? Anything important i miss? Le 17 févr. 2018 21:11, "Jean-Baptiste Onofré" <j...@nanthrax.net> a écrit : > I agree, it's a decent assumption. > > Regards > JB > > On 02/17/2018 05:59 PM, Romain Manni-Bucau wrote: > > Assuming a Pipeline.run(); the corresponding sequence: > > > > WorkerStartFn(); > > WorkerEndFn(); > > > > So a single instance of the fn for the full pipeline execution. > > > > Le 17 févr. 2018 17:42, "Reuven Lax" <re...@google.com > > <mailto:re...@google.com>> a écrit : > > > > " and a transform is by design bound to an execution" > > > > What do you mean by execution? > > > > On Sat, Feb 17, 2018 at 12:50 AM, Romain Manni-Bucau < > rmannibu...@gmail.com > > <mailto:rmannibu...@gmail.com>> wrote: > > > > > > > > Le 16 févr. 2018 22:41, "Reuven Lax" <re...@google.com > > <mailto:re...@google.com>> a écrit : > > > > Kenn is correct. Allowing Fn reuse across bundles was a > major, major > > performance improvement. Profiling on the old Dataflow SDKs > > consistently showed Java serialization being the number one > > performance bottleneck for streaming pipelines, and Beam > fixed this. > > > > > > Sorry but this doesnt help me much to understand. Let me try to > explain. > > I read it as "we were slow somehow around serialization so a > quick fix > > was caching". > > > > It is not to be picky but i had a lot of remote ejb over rmi > super fast > > setup do java serialization is slower than alternative > serialization, > > right, but doesnt justify caching most of the time. > > > > My main interrogation is: isnt beam which is designed to be slow > in the > > way it designed the dofn/transform and therefore serializes way > more > > than it requires - you never care to serialize the full > transform and > > can in 95% do a writeReplace which is light and fast compared to > the > > default. > > > > If so the cache is an implementation workaround and not a fix. > > > > Hope my view is clearer on it. > > > > > > > > Romain - can you state precisely what you want? I do think > there is > > still a gap - IMO there's a place for a longer-lived per-fn > > container; evidence for this is that people still often need > to use > > statics to store things. However I'm not sure if this is > what you're > > looking for. > > > > > > Yes. I build a framework on top of beam and must be able to > provide a > > lifecycle clear and reliable. The bare minimum for any user is > > start-exec-stop and a transform is by design bound to an > execution > > (stream or batch). > > > > Bundles are not an option as explained cause not bound to the > execution > > but an uncontrolled subpart. You can see it as a beam internal > until > > runners unify this definition. And in any case it is closer to a > chunk > > notion than a lifecycle one. > > > > So setup and teardown must be symmetric. > > > > Note that a dofn instance owns a config so is bound to an > execution. > > > > This all lead to the nees of a reliable teardown. > > > > Caching can be neat bit requires it own api like passivation one > of ejbs. > > > > > > > > Reuven > > > > On Fri, Feb 16, 2018 at 1:33 PM, Kenneth Knowles < > k...@google.com > > <mailto:k...@google.com>> wrote: > > > > On Fri, Feb 16, 2018 at 1:00 PM, Romain Manni-Bucau > > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> > wrote: > > > > The serialization of fn being once per bundle, the > perf > > impact is only huge if there is a bug somewhere > else, even > > java serialization is negligeable on big config > compared to > > any small pipeline (seconds vs minutes). > > > > > > Profiling is clear that this is a huge performance > impact. One > > of the most important backwards-incompatible changes we > made for > > Beam 2.0.0 was to allow Fn reuse across bundles. > > > > When we used a DoFn only for one bundle, there was no > @Teardown > > because it has ~no use. You do everything in > @FinishBundle. So > > for whatever use case you are working on, if your > pipeline > > performs well enough doing it per bundle, you can put it > in > > @FinishBundle. Of course it still might not get called > because > > that is a logical impossibility - you just know that for > a given > > element the element will be retried if @FinishBundle > fails. > > > > If you have cleanup logic that absolutely must get > executed, > > then you need to build a composite PTransform around it > so it > > will be retried until cleanup succeeds. In Beam's sinks > you can > > find many examples. > > > > Kenn > > > > > > > > > > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >