On Tue, Jun 25, 2019 at 2:26 PM Ismaël Mejía <[email protected]> wrote:
>
> I stumbled recently into BEAM-3644 [1]. This issue mentions that
> Python direct runner saw a great performance gain because of relying
> on portability’s FnApiRunner. This seems to me a bit contra-intuitive
> considering the extra overhead of portability. How is this possible or
> what is the explanation for this? My assumption is that we are talking
> for this case about process embedding without networking services for
> Fn API, is it the case? I am surprised it is better given the extra
> layers and curious if there are improvements in latency too, so this
> could benefit interactive uses like Jupyter notebooks. Any info or
> details on this?

The FnApiRunner, when used in direct mode, communicates with an
in-process "worker" via the portability protos, but does not do so
over an GRPC connection (and the data channel is a simple, inline one
that avoids the protos altogether), so overall the overhead is minimal
(except the serialization of elements over the data channel, though
that only happens at fusion breaks). It may be possible to improve on
this a bit (e.g. one idea is encoding references only rather than full
values over the data channel, as everything is in memory, though this
could have negative impact on peak memory usage).

The primary performance win is that the FnApiRunner is based on the
same code we use in the Worker, which has been much more optimized and
has a more advanced execution flow (e.g. uses fusion).

> I also saw BEAM-3645 [2] to support multi-process execution on python
> direct runner which could eventually bring even better performance.
> Will this work in a 1 to 1 mapping between client and service or
> somehow the FnApiRunner would deal with concurrency? I suppose this
> change will benefit the performance of Python for portability too.
> Just curious to understand if it is the case and how it works.

Here the change is that multiple workers would be instantiated talking
to the single FnApiRunner control service(s), though likely the
workers would be other processes (to get around the GIL) in which case
GRPC would be used. Work would be sharded among the workers (e.g. a
single PCollection split into N pieces and processed concurrently). I
don't think this would impact other portable runners, as they're
already free to start up as many SDK harnesses, and have as many
workers within each SDK harness, as they want already, but it should
be a big win for running on a single multi-core machine.

> [1] https://issues.apache.org/jira/browse/BEAM-3644
> [2] https://issues.apache.org/jira/browse/BEAM-3645

Reply via email to