Re: PTransforms and Fusion

Henning Rohde Tue, 11 Sep 2018 10:01:53 -0700

Empty pipelines have neither subtransforms or a spec, which is what I don't
think is useful -- especially given the only usecase (which is really
"nop") would create non-timer loops in the representations. I'd rather have
a well-known nop primitive instead. Even now, for the A example, I don't
think it's unreasonable to add a (well-known) identity transform inside a
normal composite to retain the extrema at either end. It could be ignored
at runtime at no cost.


To clarify my support for A1, native transforms would have a spec and would
be passed through in the shared code even through they're not primitives.


On Tue, Sep 11, 2018 at 12:56 AM Robert Bradshaw <[email protected]>
wrote:

> For (A), it really boils down to the question of what is a legal pipeline.
> A1 takes the position that all empty transforms must be on a whitelist
> (which implies B1, unless we make the whitelist extensible, which starts to
> sound a lot like B3). Presumably if we want to support B2, we cannot remove
> all empty unknown transforms, just those whose outputs are a subset of the
> inputs.
>
> The reason I strongly support A3 is that empty PTransforms are not just
> noise, they are expressions of user intent, and the pipeline graph should
> reflect that as faithfully as possible. This is the whole point of
> composite transforms--one should not be required to expose what is inside
> (even whether it's empty). Consider, for example, an A, B -> C transform
> that mixes A and B in proportions to produce C. In the degenerate case
> where we want 100% for A or 100% from B, it's reasonable to implement this
> by just returning A or B directly. But when, say, visualizing the pipeline
> graph, I don't think it's desirable to have the discontinuity of the
> composite transform suddenly disappearing when the mixing parameter is at
> either extreme.
>
> If a runner cannot handle these empty pipelines (as is the case for those
> relying on the current Java libraries) it is an easy matter for it to drop
> them, but that doesn't mean we should withhold this information (by making
> it illegal and dropping it in every SDK) from a runner (or any other tool)
> that would want to see this information.
>
> - Robert
>
>
> On Tue, Sep 11, 2018 at 4:20 AM Henning Rohde <[email protected]> wrote:
>
>> For A, I am in favor of A1 and A2 as well. It is then up to each SDK to
>> not generate "empty" transforms in the proto representation as we avoid
>> noise as mentioned. The shared Java libraries are also optional and we
>> should not assume every runner will use them. I'm not convinced empty
>> transforms would have value for pipeline structure over what can be
>> accomplished with normal composites. I suspect empty transforms, such as A,
>> B -> B, B, will just be confusion generators.
>>
>> For B, I favor B2 for the reasons Thomas mentions. I also agree with the
>> -1 for B1.
>>
>> On Mon, Sep 10, 2018 at 2:51 PM Thomas Weise <[email protected]> wrote:
>>
>>> For B, note the prior discussion [1].
>>>
>>> B1 and B2 cannot be supported at the same time.
>>>
>>> Native transforms will almost always be customizations. Users do not
>>> create customizations without reason. They would start with what is there
>>> and dig deeper only when needed. Right now there are no streaming
>>> connectors in the Python SDK - should the user not use the SDK? Or is it
>>> better (now and in general) to have the option of a custom connector, even
>>> when it is not portable?
>>>
>>> Regarding portability, IMO it should be up to the user to decide how
>>> much of it is necessary/important. The IO requirements are normally
>>> dictated by the infrastructure. If it has Kafka, Kinesis or any other
>>> source (including those that Beam might never have a connector for), the
>>> user needs to be able to integrate it.
>>>
>>> Overall extensibility is important and will help users adopt Beam. This
>>> has come up in a few other areas (think Docker environments). I think we
>>> need to provide the flexibility and enable, not prevent alternatives and
>>> therefore -1 for B1 (unsurprisingly :).
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/9813ee10cb1cd9bf64e1c4f04c02b606c7b12d733f4505fb62f4a954@%3Cdev.beam.apache.org%3E
>>>
>>>
>>> On Mon, Sep 10, 2018 at 10:14 AM Robert Bradshaw <[email protected]>
>>> wrote:
>>>
>>>> A) I think it's a bug to not handle empty PTransforms (which are useful
>>>> at pipeline construction, and may still have meaning in terms of pipeline
>>>> structure, e.g. for visualization). Note that such transforms, if truly
>>>> composite, can't output any PCollections that do not appear in their inputs
>>>> (which is how we distinguish them from primitives in Python). Thus I'm in
>>>> favor of A3, and as a stopgap we can drop these transforms as part of/just
>>>> before decoding in the Java libraries (rather than in the SDKs during
>>>> encoding as in A2).
>>>>
>>>> B) I'm also for B1 or B2.
>>>>
>>>>
>>>> On Mon, Sep 10, 2018 at 3:58 PM Maximilian Michels <[email protected]>
>>>> wrote:
>>>>
>>>>> > A) What should we do with these "empty" PTransforms?
>>>>>
>>>>> We can't translate them, so dropping them seems the most reasonable
>>>>> choice. Should we throw an error/warning to make the user aware of
>>>>> this?
>>>>> Otherwise might be unexpected for the user.
>>>>>
>>>>> >> A3) Handle the "empty" PTransform case within all of the shared
>>>>> libraries.
>>>>>
>>>>> What can we do at this point other than dropping them?
>>>>>
>>>>> > B) What should we do with "native" PTransforms?
>>>>>
>>>>> I support B1 and B2 as well. Non-portable PTransforms should be
>>>>> discouraged in the long run. However, the available PTransforms are
>>>>> not
>>>>> even consistent across the different SDKs yet (e.g. no streaming
>>>>> connectors in Python), so we should continue to provide a way to
>>>>> utilize
>>>>> the "native" transforms of a Runner.
>>>>>
>>>>> -Max
>>>>>
>>>>> On 07.09.18 19:15, Lukasz Cwik wrote:
>>>>> > A primitive transform is a PTransform that has been chosen to have
>>>>> no
>>>>> > default implementation in terms of other PTransforms. A primitive
>>>>> > transform therefore must be implemented directly by a pipeline
>>>>> runner in
>>>>> > terms of pipeline-runner-specific concepts. An initial list of
>>>>> primitive
>>>>> > PTransforms were defined in [2] and has since been updated in [3].
>>>>> >
>>>>> > As part of the portability effort, libraries that are intended to be
>>>>> > shared across multiple runners are being developed to support their
>>>>> > migration to a portable execution model. One of these is responsible
>>>>> for
>>>>> > fusing multiple primitive PTransforms together into a pipeline
>>>>> runner
>>>>> > specific concept. This library made the choice that a primitive
>>>>> > PTransform is a PTransform that doesn't contain any other
>>>>> PTransforms.
>>>>> >
>>>>> > Unfortunately, while Ryan was attempting to enable testing of
>>>>> validates
>>>>> > runner tests for Flink using the new portability libraries, he ran
>>>>> into
>>>>> > an issue where the Apache Beam Java SDK allows for a person to
>>>>> construct
>>>>> > a PTransform that has zero sub PTransforms and also isn't one of the
>>>>> > defined Apache Beam primitives. In this case the PTransform was
>>>>> trivial
>>>>> > as it was not applying any additional transforms to input
>>>>> PCollection
>>>>> > and just returning it. This caused an issue within the portability
>>>>> > libraries since they couldn't handle this structure.
>>>>> >
>>>>> > To solve this issue, I had proposed that we modify the portability
>>>>> > library that does fusion to use a whitelist of primitives preventing
>>>>> the
>>>>> > issue from happening. This solved the problem but caused an issue
>>>>> for
>>>>> > Thomas as he was relying on this behaviour of PTransforms with zero
>>>>> sub
>>>>> > transforms being primitives. Thomas has a use-case where he wants to
>>>>> > expose the internal Flink Kafka and Kinesis connectors and to build
>>>>> > Apache Beam pipelines that use the Flink native sources/sinks. I'll
>>>>> call
>>>>> > these "native" PTransforms, since they aren't part of the Apache
>>>>> Beam
>>>>> > model and are runner specific.
>>>>> >
>>>>> > This brings up two topics:
>>>>> > A) What should we do with these "empty" PTransforms?
>>>>> > B) What should we do with "native" PTransforms?
>>>>> >
>>>>> > The typical flow of a pipeline representation for a portable
>>>>> pipeline is:
>>>>> > language specific representation -> proto representation -> job
>>>>> service
>>>>> > -> shared libraries that simplify/replace the proto representation
>>>>> with
>>>>> > a simplified version (e.g. fusion) -> runner specific conversion to
>>>>> > native runner concepts (e.g. GBK -> runner implementation of GBK)
>>>>> >
>>>>> > ------------------
>>>>> >
>>>>> > A) What should we do with these "empty" PTransforms?
>>>>> >
>>>>> > To give a little more detail, these transforms typically can happen
>>>>> if
>>>>> > people have conditional logic such as loops that would perform an
>>>>> > expansion but do nothing if the condition is immediately
>>>>> unsatisfied. So
>>>>> > allowing for PTransforms that are empty is useful when building a
>>>>> pipeline.
>>>>> >
>>>>> > What should we do:
>>>>> > A1) Stick with the whitelist of primitive PTransforms.
>>>>> > A2) When converting the pipeline from language specific
>>>>> representation
>>>>> > into the proto representation, drop any "empty" PTransforms. This
>>>>> means
>>>>> > that the pipeline representation that is sent to the runner doesn't
>>>>> > contain the offending type of PTransform and the shared libraries
>>>>> > wouldn't have to change.
>>>>> > A3) Handle the "empty" PTransform case within all of the shared
>>>>> libraries.
>>>>> >
>>>>> > I like doing both A1 and A2. A1 since it helps simplify the shared
>>>>> > libraries since we know the whole list of primitives we need to
>>>>> > understand and A2 because it removes noise within the pipeline shape
>>>>> > from its representation.
>>>>> >
>>>>> > ------------------
>>>>> >
>>>>> > B) What should we do with "native" PTransforms?
>>>>> >
>>>>> > Some approaches that we could take as a community:
>>>>> >
>>>>> > B1) Prevent the usage of "native" PTransforms within Apache Beam
>>>>> since
>>>>> > they hurt portability of pipelines across runners. This can be done
>>>>> by
>>>>> > specifically using whitelists of allowed primitive PTransforms in
>>>>> the
>>>>> > shared libraries and explicitly not allowing for shared libraries to
>>>>> > have extension points customizing this.
>>>>> >
>>>>> > B2) We embrace that closed systems internal to companies will want
>>>>> to
>>>>> > use their own extensions and enable support for "native" PTransforms
>>>>> but
>>>>> > actively discourage "native" PTransforms in the open ecosystem.
>>>>> >
>>>>> > B3) We embrace and allow for "native" PTransforms in the open
>>>>> ecosystem.
>>>>> >
>>>>> > "native" PTransforms are useful in closed systems since they allow
>>>>> > companies to solve certain scenarios which would not be practical to
>>>>> > expose the Apache Beam community. It does take more work for the
>>>>> > community to support these types of extensions. To my knowledge,
>>>>> Google
>>>>> > is likely to want to do something similar to handle internal use
>>>>> cases
>>>>> > similar to what Thomas is trying to do.
>>>>> >
>>>>> > I'm for either B1 or B2 since the risk of embracing and allowing for
>>>>> > "native" PTransforms in the open ecosystem is likely to fragment the
>>>>> > project and also is counter to what portability is really about.
>>>>> >
>>>>> > 1: https://github.com/apache/beam/pull/6328
>>>>> > 2:
>>>>> >
>>>>> https://docs.google.com/document/d/1bao-5B6uBuf-kwH1meenAuXXS0c9cBQ1B2J59I3FiyI/edit#heading=h.tt55lhd3k6by
>>>>> >
>>>>> > 3:
>>>>> >
>>>>> https://github.com/apache/beam/blob/6df2ef3ec9c835097e79b4441ce47ff09a458894/model/pipeline/src/main/proto/beam_runner_api.proto#L180
>>>>>
>>>>

Re: PTransforms and Fusion

Reply via email to