+1 for removing apply_*

For the Java SDK, removing specialized intercepts was an important first
step towards the portability framework. I wonder if there is a way for the
Python SDK to leapfrog, taking advantage of some of the lessons that Java
learned a bit more painfully. Most pertinent I think is that if an SDK's
role is to construct a pipeline and ship the proto to a runner (service)
then overrides apply to a post-deserialization pipeline. The Java
DirectRunner does a proto round-trip to avoid accidentally depending on
things that are not really part of the pipeline. I would this crisp
abstraction enforcement would add even more value to Python.

Kenn

On Thu, Feb 1, 2018 at 5:21 PM, Charles Chen <[email protected]> wrote:

> In the Python DirectRunner, we currently use apply_* overrides to override
> the operation of the default .expand() operation for certain transforms.
> For example, GroupByKey has a special implementation in the DirectRunner,
> so we use an apply_* override hook to replace the implementation of
> GroupByKey.expand().
>
> However, this strategy has drawbacks. Because this override operation
> happens eagerly during graph construction, the pipeline graph is
> specialized and modified before a specific runner is bound to the
> pipeline's execution. This makes the pipeline graph non-portable and blocks
> full migration to using the Runner API pipeline representation in the
> DirectRunner.
>
> By contrast, the SDK's PTransformOverride mechanism allows the expression
> of matchers that operate on the unspecialized graph, replacing PTransforms
> as necessary to produce a DirectRunner-specialized pipeline graph for
> execution.
>
> I therefore propose to replace these eager apply_* overrides with
> PTransformOverrides that operate on the completely constructed graph.
>
> The JIRA issue is https://issues.apache.org/jira/browse/BEAM-3566, and
> I've prepared a candidate patch at https://github.com/apache/
> incubator-beam/pull/4529.
>
> Best,
> Charles
>

Reply via email to