Le 16 févr. 2018 22:41, "Reuven Lax" <re...@google.com> a écrit :
Kenn is correct. Allowing Fn reuse across bundles was a major, major performance improvement. Profiling on the old Dataflow SDKs consistently showed Java serialization being the number one performance bottleneck for streaming pipelines, and Beam fixed this. Sorry but this doesnt help me much to understand. Let me try to explain. I read it as "we were slow somehow around serialization so a quick fix was caching". It is not to be picky but i had a lot of remote ejb over rmi super fast setup do java serialization is slower than alternative serialization, right, but doesnt justify caching most of the time. My main interrogation is: isnt beam which is designed to be slow in the way it designed the dofn/transform and therefore serializes way more than it requires - you never care to serialize the full transform and can in 95% do a writeReplace which is light and fast compared to the default. If so the cache is an implementation workaround and not a fix. Hope my view is clearer on it. Romain - can you state precisely what you want? I do think there is still a gap - IMO there's a place for a longer-lived per-fn container; evidence for this is that people still often need to use statics to store things. However I'm not sure if this is what you're looking for. Yes. I build a framework on top of beam and must be able to provide a lifecycle clear and reliable. The bare minimum for any user is start-exec-stop and a transform is by design bound to an execution (stream or batch). Bundles are not an option as explained cause not bound to the execution but an uncontrolled subpart. You can see it as a beam internal until runners unify this definition. And in any case it is closer to a chunk notion than a lifecycle one. So setup and teardown must be symmetric. Note that a dofn instance owns a config so is bound to an execution. This all lead to the nees of a reliable teardown. Caching can be neat bit requires it own api like passivation one of ejbs. Reuven On Fri, Feb 16, 2018 at 1:33 PM, Kenneth Knowles <k...@google.com> wrote: > On Fri, Feb 16, 2018 at 1:00 PM, Romain Manni-Bucau <rmannibu...@gmail.com > > wrote: >> >> The serialization of fn being once per bundle, the perf impact is only >> huge if there is a bug somewhere else, even java serialization is >> negligeable on big config compared to any small pipeline (seconds vs >> minutes). >> > > Profiling is clear that this is a huge performance impact. One of the most > important backwards-incompatible changes we made for Beam 2.0.0 was to > allow Fn reuse across bundles. > > When we used a DoFn only for one bundle, there was no @Teardown because it > has ~no use. You do everything in @FinishBundle. So for whatever use case > you are working on, if your pipeline performs well enough doing it per > bundle, you can put it in @FinishBundle. Of course it still might not get > called because that is a logical impossibility - you just know that for a > given element the element will be retried if @FinishBundle fails. > > If you have cleanup logic that absolutely must get executed, then you need > to build a composite PTransform around it so it will be retried until > cleanup succeeds. In Beam's sinks you can find many examples. > > Kenn > >