+1 to this change. Thank you Charles for improving the DirectRunner, sharing your progress and seeking feedback. This change would allow us to migrate to a faster DirectRunner for Python. A long time requested feature and an important part of the first use experience for new users trying out Beam.
Ahmet On Fri, Feb 2, 2018 at 11:00 AM, Charles Chen <c...@google.com> wrote: > Thanks Kenn. We already do the Runner API roundtripping (I believe Robert > implemented this). With this change, we would start doing exactly what > you're suggesting, where we apply overrides to a post-deserialization > pipeline. > > On Thu, Feb 1, 2018 at 6:45 PM Kenneth Knowles <k...@google.com> wrote: > >> +1 for removing apply_* >> >> For the Java SDK, removing specialized intercepts was an important first >> step towards the portability framework. I wonder if there is a way for the >> Python SDK to leapfrog, taking advantage of some of the lessons that Java >> learned a bit more painfully. Most pertinent I think is that if an SDK's >> role is to construct a pipeline and ship the proto to a runner (service) >> then overrides apply to a post-deserialization pipeline. The Java >> DirectRunner does a proto round-trip to avoid accidentally depending on >> things that are not really part of the pipeline. I would this crisp >> abstraction enforcement would add even more value to Python. >> >> Kenn >> >> On Thu, Feb 1, 2018 at 5:21 PM, Charles Chen <c...@google.com> wrote: >> >>> In the Python DirectRunner, we currently use apply_* overrides to >>> override the operation of the default .expand() operation for certain >>> transforms. For example, GroupByKey has a special implementation in the >>> DirectRunner, so we use an apply_* override hook to replace the >>> implementation of GroupByKey.expand(). >>> >>> However, this strategy has drawbacks. Because this override operation >>> happens eagerly during graph construction, the pipeline graph is >>> specialized and modified before a specific runner is bound to the >>> pipeline's execution. This makes the pipeline graph non-portable and blocks >>> full migration to using the Runner API pipeline representation in the >>> DirectRunner. >>> >>> By contrast, the SDK's PTransformOverride mechanism allows the >>> expression of matchers that operate on the unspecialized graph, replacing >>> PTransforms as necessary to produce a DirectRunner-specialized pipeline >>> graph for execution. >>> >>> I therefore propose to replace these eager apply_* overrides with >>> PTransformOverrides that operate on the completely constructed graph. >>> >>> The JIRA issue is https://issues.apache.org/jira/browse/BEAM-3566, and >>> I've prepared a candidate patch at https://github.com/apache/ >>> incubator-beam/pull/4529. >>> >>> Best, >>> Charles >>> >> >>