Re: Removing the PValueCache from the Beam Python DirectRunner

Charles Chen Thu, 25 Jan 2018 16:12:48 -0800

Yes, that is correct.  The scope of the attached fix is for in-process
runners.  For remote runners, we should think about how to make PCollection
contents available after pipeline execution.  We may also need to better
design eager / interactive execution for that use case, since our current
use of eager mode is geared towards testing transforms locally.


On Thu, Jan 25, 2018 at 4:07 PM Robert Bradshaw <[email protected]> wrote:

> Sounds good. I assume there will still need to be runner-specific
> support for any runner that chooses to implement this (e.g. writing to
> remote files then reading them in?)
>
> On Thu, Jan 25, 2018 at 3:25 PM, Charles Chen <[email protected]> wrote:
> > Currently, the Python SDK supports an eager execution mode.  For
> example, a
> > list can be directly passed into a PTransform to obtain its result:
> >
> > result = [1, 2, 3] | MyPTransform()
> >
> > To support this use, the Python DirectRunner has an option to cache its
> > intermediate results into a PValueCache.  The above line, when run,
> > implicitly creates an ephemeral pipeline and runs it with the
> DirectRunner.
> > This, however, adds a lot of complexity to the DirectRunner, and is not
> > generalizable to other in-process Python runners (like the in-process
> Python
> > FnApiRunner, which runs batch pipelines more efficiently than the current
> > Python DirectRunner).
> >
> > To improve this, I will be removing this DirectRunner-specific
> > implementation and add functionality that allows all in-process Python
> > runners to be run in eager mode.
> >
> > Jira issue: https://issues.apache.org/jira/browse/BEAM-3537
> > Candidate fix: https://github.com/apache/beam/pull/4492
> >
> > Best,
> > Charles
>

Re: Removing the PValueCache from the Beam Python DirectRunner

Reply via email to