Thanks a lot for the explanation Robert, it makes sense now.
On Tue, Jun 25, 2019 at 2:58 PM Robert Bradshaw <rober...@google.com> wrote: > > On Tue, Jun 25, 2019 at 2:26 PM Ismaël Mejía <ieme...@gmail.com> wrote: > > > > I stumbled recently into BEAM-3644 [1]. This issue mentions that > > Python direct runner saw a great performance gain because of relying > > on portability’s FnApiRunner. This seems to me a bit contra-intuitive > > considering the extra overhead of portability. How is this possible or > > what is the explanation for this? My assumption is that we are talking > > for this case about process embedding without networking services for > > Fn API, is it the case? I am surprised it is better given the extra > > layers and curious if there are improvements in latency too, so this > > could benefit interactive uses like Jupyter notebooks. Any info or > > details on this? > > The FnApiRunner, when used in direct mode, communicates with an > in-process "worker" via the portability protos, but does not do so > over an GRPC connection (and the data channel is a simple, inline one > that avoids the protos altogether), so overall the overhead is minimal > (except the serialization of elements over the data channel, though > that only happens at fusion breaks). It may be possible to improve on > this a bit (e.g. one idea is encoding references only rather than full > values over the data channel, as everything is in memory, though this > could have negative impact on peak memory usage). > > The primary performance win is that the FnApiRunner is based on the > same code we use in the Worker, which has been much more optimized and > has a more advanced execution flow (e.g. uses fusion). > > > I also saw BEAM-3645 [2] to support multi-process execution on python > > direct runner which could eventually bring even better performance. > > Will this work in a 1 to 1 mapping between client and service or > > somehow the FnApiRunner would deal with concurrency? I suppose this > > change will benefit the performance of Python for portability too. > > Just curious to understand if it is the case and how it works. > > Here the change is that multiple workers would be instantiated talking > to the single FnApiRunner control service(s), though likely the > workers would be other processes (to get around the GIL) in which case > GRPC would be used. Work would be sharded among the workers (e.g. a > single PCollection split into N pieces and processed concurrently). I > don't think this would impact other portable runners, as they're > already free to start up as many SDK harnesses, and have as many > workers within each SDK harness, as they want already, but it should > be a big win for running on a single multi-core machine. > > > [1] https://issues.apache.org/jira/browse/BEAM-3644 > > [2] https://issues.apache.org/jira/browse/BEAM-3645