Yes it does work for Java pipelines, modulo https://github.com/apache/beam/pull/4211 . I'm actually not sure what the performance characteristics are; but I'm sure it's not as dramatic as improvement (if any) compared to what we see in Python. It's great for development though.
On Fri, Feb 16, 2018 at 12:06 PM, Marián Dvorský <mari...@google.com> wrote: > Does the same runner work for Java pipelines? (I assume so, given that it > uses portability framework.) If so, does it provide similar speedup? > > On Fri, Feb 16, 2018 at 7:37 PM Robert Bradshaw <rober...@google.com> wrote: >> >> If there are no concerns, I say let's merge this. >> >> On Fri, Feb 16, 2018 at 9:39 AM, Charles Chen <c...@google.com> wrote: >> > I hope those interested have had time to test this out. I have sent out >> > https://github.com/apache/beam/pull/4696 to switch to using this fast >> > runner >> > as the default DirectRunner for local execution. Let me know if there >> > are >> > any concerns. >> > >> > On Tue, Feb 13, 2018 at 12:17 PM Charles Chen <c...@google.com> wrote: >> >> >> >> This is now checked into master. You can use it by setting >> >> --runner=SwitchingDirectRunner. Please let us know if you run into any >> >> issues. >> >> >> >> >> >> On Thu, Feb 8, 2018 at 10:30 AM Romain Manni-Bucau >> >> <rmannibu...@gmail.com> >> >> wrote: >> >>> >> >>> Very interesting! Sounds like a sane way for beam future and I'm very >> >>> happy it is consistent with the current Java experience: no need to >> >>> interlace runners at the end, it makes design, code and user >> >>> experience way >> >>> better than trying to put everything in the direct runner :). >> >>> >> >>> Le 8 févr. 2018 19:20, "María García Herrero" <mari...@google.com> a >> >>> écrit : >> >>>> >> >>>> Amazing improvement, Charles. >> >>>> Thanks for the effort! >> >>>> >> >>>> >> >>>> On Thu, Feb 8, 2018 at 10:14 AM Eugene Kirpichov >> >>>> <kirpic...@google.com> >> >>>> wrote: >> >>>>> >> >>>>> Sounds awesome, congratulations and thanks for making this happen! >> >>>>> >> >>>>> On Thu, Feb 8, 2018 at 10:07 AM Raghu Angadi <rang...@google.com> >> >>>>> wrote: >> >>>>>> >> >>>>>> This is terrific news! Thanks Charles. >> >>>>>> >> >>>>>> On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen <c...@google.com> >> >>>>>> wrote: >> >>>>>>> >> >>>>>>> Local execution of Beam pipelines on the Python DirectRunner >> >>>>>>> currently suffers from performance issues, which makes it hard for >> >>>>>>> pipeline >> >>>>>>> authors to iterate, especially on medium to large size datasets. >> >>>>>>> We would >> >>>>>>> like to optimize and make this a better experience for Beam users. >> >>>>>>> >> >>>>>>> >> >>>>>>> The FnApiRunner was written as a way of leveraging the portability >> >>>>>>> framework execution code path for local portability development. >> >>>>>>> We've found >> >>>>>>> it also provides great speedups in batch execution with no user >> >>>>>>> changes >> >>>>>>> required, so we propose to switch to use this runner by default in >> >>>>>>> batch >> >>>>>>> pipelines. For example, WordCount on the Shakespeare dataset with >> >>>>>>> a single >> >>>>>>> CPU core now takes 50 seconds to run, compared to 12 minutes >> >>>>>>> before; this is >> >>>>>>> a 15x performance improvement that users can get for free, with no >> >>>>>>> user >> >>>>>>> pipeline changes. >> >>>>>>> >> >>>>>>> >> >>>>>>> The JIRA for this change is here >> >>>>>>> (https://issues.apache.org/jira/browse/BEAM-3644), and a candidate >> >>>>>>> patch is >> >>>>>>> available here (https://github.com/apache/beam/pull/4634). I have >> >>>>>>> been >> >>>>>>> working over the last month on making this an automatic drop-in >> >>>>>>> replacement >> >>>>>>> for the current DirectRunner when applicable. Before it becomes >> >>>>>>> the >> >>>>>>> default, you can try this runner now by manually specifying >> >>>>>>> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the >> >>>>>>> runner. >> >>>>>>> >> >>>>>>> >> >>>>>>> Even with this change, local Python pipeline execution can only >> >>>>>>> effectively use one core because of the Python GIL. A natural >> >>>>>>> next step to >> >>>>>>> further improve performance will be to refactor the FnApiRunner to >> >>>>>>> allow for >> >>>>>>> multi-process execution. This is being tracked here >> >>>>>>> (https://issues.apache.org/jira/browse/BEAM-3645). >> >>>>>>> >> >>>>>>> >> >>>>>>> Best, >> >>>>>>> >> >>>>>>> Charles >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> >> >>>> Impact is the effect that wouldn’t have happened if you hadn’t done >> >>>> what >> >>>> you did. >> >>>> >> >>>> >> >