Le 17 févr. 2018 22:31, "Eugene Kirpichov" <[email protected]> a écrit :

On Sat, Feb 17, 2018 at 1:10 PM Romain Manni-Bucau <[email protected]>
wrote:

> You phrased it right Eugene - thanks for that.
>
> However the solution is not functional I think - hope I missed something.
> With distribution etc you cant use by reference param passing, therefore no
> way to clean up the internal states of another fn. So i kind of feel back
> to the original need :(.
>
Correct - in a fault-tolerant distributed data processing system you can't
have one PTransform talk to another PTransform as if it was an in-memory
Java object. If you want to talk to an in-memory non-distributed Java
object, you have to keep it contained to a single block of your own Java
code (e.g. a single ProcessElement call) and use try-with-resources or
something.

If you give an example of a high-level need (e.g. "I'm trying to write an
IO for system $x and it requires the following initialization and the
following cleanup logic and the following processing in between") I'll be
better able to help you.


Take a simple example of a transform requiring a connection. Using bundles
is a perf killer since size is not controlled. Using teardown doesnt allow
you to release the connection since it is a best effort thing. Not
releasing the connection makes you pay a lot - aws ;) - or prevents you to
launch other processings - concurrent limit.

This is a trivial and common case where a clear instance lifecycle is
required to have a well behaving and portably IO - or transform for
distributed locks for instance but this case is more complex.

There are API requiring an acquire and a release calls and setup/teardown
are the best place to do it but must be a must.

If bundles size is no more runner dependent - dont think it is possible
with the coming sdf but explaining why buncles are not an option - it can
become an option since user can set it big enough to ignore other side
effects (concurrency, perf etc). Would also allow to get rid of the maxSize
configs which are quite weird for a batch solution
(chunking/commit-interval is native since more than 30 years in most batch
products no? ... can need to be fixed later enriching sdf api when
supported by all runners, lets ignore it maybe for this thread).



>
> Also interested in a probably stupid question: why
> teardown/setup/startbundle/finishbundle are in the api if it is not
> usable? Dont we want as a portable lib fix that?
>
I'm not following this: they are usable, are successfully used in many IO
connectors, and their advertised behavior is implemented by all runners
(though with somewhat different performance characteristics), including
when using the portability framework. If you feel that this is not the
case, please file a JIRA with a code example that behaves not as you
expected in a specific runner.


>
> Once again i see to technical blocker - checked flink/spark/direct runners
> - to make it usable so why not simplifying user lifes? Anything important i
> miss?
>
> Le 17 févr. 2018 21:11, "Jean-Baptiste Onofré" <[email protected]> a écrit :
>
>> I agree, it's a decent assumption.
>>
>> Regards
>> JB
>>
>> On 02/17/2018 05:59 PM, Romain Manni-Bucau wrote:
>> > Assuming a Pipeline.run(); the corresponding sequence:
>> >
>> > WorkerStartFn();
>> > WorkerEndFn();
>> >
>> > So a single instance of the fn for the full pipeline execution.
>> >
>> > Le 17 févr. 2018 17:42, "Reuven Lax" <[email protected]
>> > <mailto:[email protected]>> a écrit :
>> >
>> >     " and a transform is by design bound to an execution"
>> >
>> >     What do you mean by execution?
>> >
>> >     On Sat, Feb 17, 2018 at 12:50 AM, Romain Manni-Bucau <
>> [email protected]
>> >     <mailto:[email protected]>> wrote:
>> >
>> >
>> >
>> >         Le 16 févr. 2018 22:41, "Reuven Lax" <[email protected]
>> >         <mailto:[email protected]>> a écrit :
>> >
>> >             Kenn is correct. Allowing Fn reuse across bundles was a
>> major, major
>> >             performance improvement. Profiling on the old Dataflow SDKs
>> >             consistently showed Java serialization being the number one
>> >             performance bottleneck for streaming pipelines, and Beam
>> fixed this.
>> >
>> >
>> >         Sorry but this doesnt help me much to understand. Let me try to
>> explain.
>> >         I read it as "we were slow somehow around serialization so a
>> quick fix
>> >         was caching".
>> >
>> >         It is not to be picky but i had a lot of remote ejb over rmi
>> super fast
>> >         setup do java serialization is slower than alternative
>> serialization,
>> >         right, but doesnt justify caching most of the time.
>> >
>> >         My main interrogation is: isnt beam which is designed to be
>> slow in the
>> >         way it designed the dofn/transform and therefore serializes way
>> more
>> >         than it requires - you never care to serialize the full
>> transform and
>> >         can in 95% do a writeReplace which is light and fast compared
>> to the
>> >         default.
>> >
>> >         If so the cache is an implementation workaround and not a fix.
>> >
>> >         Hope my view is clearer on it.
>> >
>> >
>> >
>> >             Romain - can you state precisely what you want? I do think
>> there is
>> >             still a gap - IMO there's a place for a longer-lived per-fn
>> >             container; evidence for this is that people still often
>> need to use
>> >             statics to store things. However I'm not sure if this is
>> what you're
>> >             looking for.
>> >
>> >
>> >         Yes. I build a framework on top of beam and must be able to
>> provide a
>> >         lifecycle clear and reliable. The bare minimum for any user is
>> >         start-exec-stop and a transform is by design bound to an
>> execution
>> >         (stream or batch).
>> >
>> >         Bundles are not an option as explained cause not bound to the
>> execution
>> >         but an uncontrolled subpart. You can see it as a beam internal
>> until
>> >         runners unify this definition. And in any case it is closer to
>> a chunk
>> >         notion than a lifecycle one.
>> >
>> >         So setup and teardown must be symmetric.
>> >
>> >         Note that a dofn instance owns a config so is bound to an
>> execution.
>> >
>> >         This all lead to the nees of a reliable teardown.
>> >
>> >         Caching can be neat bit requires it own api like passivation
>> one of ejbs.
>> >
>> >
>> >
>> >             Reuven
>> >
>> >             On Fri, Feb 16, 2018 at 1:33 PM, Kenneth Knowles <
>> [email protected]
>> >             <mailto:[email protected]>> wrote:
>> >
>> >                 On Fri, Feb 16, 2018 at 1:00 PM, Romain Manni-Bucau
>> >                 <[email protected] <mailto:[email protected]>>
>> wrote:
>> >
>> >                     The serialization of fn being once per bundle, the
>> perf
>> >                     impact is only huge if there is a bug somewhere
>> else, even
>> >                     java serialization is negligeable on big config
>> compared to
>> >                     any small pipeline (seconds vs minutes).
>> >
>> >
>> >                 Profiling is clear that this is a huge performance
>> impact. One
>> >                 of the most important backwards-incompatible changes we
>> made for
>> >                 Beam 2.0.0 was to allow Fn reuse across bundles.
>> >
>> >                 When we used a DoFn only for one bundle, there was no
>> @Teardown
>> >                 because it has ~no use. You do everything in
>> @FinishBundle. So
>> >                 for whatever use case you are working on, if your
>> pipeline
>> >                 performs well enough doing it per bundle, you can put
>> it in
>> >                 @FinishBundle. Of course it still might not get called
>> because
>> >                 that is a logical impossibility - you just know that
>> for a given
>> >                 element the element will be retried if @FinishBundle
>> fails.
>> >
>> >                 If you have cleanup logic that absolutely must get
>> executed,
>> >                 then you need to build a composite PTransform around it
>> so it
>> >                 will be retried until cleanup succeeds. In Beam's sinks
>> you can
>> >                 find many examples.
>> >
>> >                 Kenn
>> >
>> >
>> >
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> [email protected]
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

Reply via email to