Hi,

could this proposal be generalized to annotations of PCollections as well? Maybe that reduces to several types of annotations of a PTransform - e.g.

 a) runtime annotations of a PTransform (that might be scheduling hints - i.e. schedule this task to nodes with GPUs, etc.)

 b) output annotations - i.e. annotations that actually apply to PCollections, as every PCollection has at most one producer (this is what have been actually discussed in the referenced mailing list threads)

It would be cool, if this added option to do PTransform expansions based on annotations of input PCollections. We tried to play with this in Euphoria DSL, but it turned out it would be best fitted in Beam SDK.

Example of input annotation sensitive expansion might be CoGBK, when one side is annotated i.e. FitsInMemoryPerWindow (or SmallPerWindow, or whatever), then CoGBK might be expanded using broadcast instead of full shuffle.

Absolutely agree that all this must not have anything to do with semantics and correctness, thus might be safely ignored, and that might answer the last question of @Reuven, when there are conflicting annotations, it would be possible to simple drop them as a last resort.

Jan

On 11/16/20 8:13 PM, Robert Burke wrote:
I imagine it has everything to do with the specific annotation to define that.

The runner notionally doesn't need to do anything with them, as they are optional, and not required for correctness.

On Mon, Nov 16, 2020, 10:56 AM Reuven Lax <re...@google.com <mailto:re...@google.com>> wrote:

    PTransforms are hierarchical - namely a PTransform contains other
    PTransforms, and so on. Is the runner expected to resolve all
    annotations down to leaf nodes? What happens if that results in
    conflicting annotations?

    On Mon, Nov 16, 2020 at 10:54 AM Robert Burke <rob...@frantil.com
    <mailto:rob...@frantil.com>> wrote:

        That's a good question.

        I think the main difference is a matter of scope. Annotations
        would apply to a PTransform while an environment applies to
        sets of transforms. A difference is the optional nature of the
        annotations they don't affect correctness. Runners don't need
        to do anything with them and still execute the pipeline
        correctly.

        Consider a privacy analysis on a pipeline graph. An annotation
        indicating that a transform provides a certain level of
        anonymization can be used in an analysis to determine if the
        downstream transforms are encountering raw data or not.

        From my understanding (which can be wrong) environments are
        rigid. Transforms in different environments can't be fused.
        "This is the python env", "this is the java env" can't be
        merged together. It's not clear to me that we have defined
        when environments are safely fuseable outside of equality.
        There's value in that simplicity.

        AFIACT environment has less to do with the machines a pipeline
        is executing on than it does about the kinds of SDK pipelines
        it understands and can execute.



        On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova
        <chad...@gmail.com <mailto:chad...@gmail.com>> wrote:


                Another example of an optional annotation is marking a
                transform to run on secure hardware, or to give hints
                to profiling/dynamic analysis tools.


            There seems to be a lot of overlap between this idea and
            Environments.  Can you talk about how you feel they may be
            different or related? For example, I could see annotations
            as a way of tagging transforms with an Environment, or I
            could see Environments becoming a specialized form of
            annotation.

            -chad

Reply via email to