+1 to this change.
Thank you Charles for improving the DirectRunner, sharing your progress and
seeking feedback. This change would allow us to migrate to a faster
DirectRunner for Python. A long time requested feature and an important
part of the first use experience for new users trying out Beam.
On Fri, Feb 2, 2018 at 11:00 AM, Charles Chen <c...@google.com> wrote:
> Thanks Kenn. We already do the Runner API roundtripping (I believe Robert
> implemented this). With this change, we would start doing exactly what
> you're suggesting, where we apply overrides to a post-deserialization
> On Thu, Feb 1, 2018 at 6:45 PM Kenneth Knowles <k...@google.com> wrote:
>> +1 for removing apply_*
>> For the Java SDK, removing specialized intercepts was an important first
>> step towards the portability framework. I wonder if there is a way for the
>> Python SDK to leapfrog, taking advantage of some of the lessons that Java
>> learned a bit more painfully. Most pertinent I think is that if an SDK's
>> role is to construct a pipeline and ship the proto to a runner (service)
>> then overrides apply to a post-deserialization pipeline. The Java
>> DirectRunner does a proto round-trip to avoid accidentally depending on
>> things that are not really part of the pipeline. I would this crisp
>> abstraction enforcement would add even more value to Python.
>> On Thu, Feb 1, 2018 at 5:21 PM, Charles Chen <c...@google.com> wrote:
>>> In the Python DirectRunner, we currently use apply_* overrides to
>>> override the operation of the default .expand() operation for certain
>>> transforms. For example, GroupByKey has a special implementation in the
>>> DirectRunner, so we use an apply_* override hook to replace the
>>> implementation of GroupByKey.expand().
>>> However, this strategy has drawbacks. Because this override operation
>>> happens eagerly during graph construction, the pipeline graph is
>>> specialized and modified before a specific runner is bound to the
>>> pipeline's execution. This makes the pipeline graph non-portable and blocks
>>> full migration to using the Runner API pipeline representation in the
>>> By contrast, the SDK's PTransformOverride mechanism allows the
>>> expression of matchers that operate on the unspecialized graph, replacing
>>> PTransforms as necessary to produce a DirectRunner-specialized pipeline
>>> graph for execution.
>>> I therefore propose to replace these eager apply_* overrides with
>>> PTransformOverrides that operate on the completely constructed graph.
>>> The JIRA issue is https://issues.apache.org/jira/browse/BEAM-3566, and
>>> I've prepared a candidate patch at https://github.com/apache/