Re: PTransform Annotations Proposal

Kenneth Knowles Mon, 23 Nov 2020 18:02:18 -0800

On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw <[email protected]> wrote:


> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran <[email protected]>
> wrote:
> >
> > Thanks everyone so much for their input and for the insightful
> discussion.
> >
> > Not being knowledgeable about Beam's internals, I have to say I am a bit
> lost on the PTransform vs. environment discussion.
> >
> > I do agree with Burke's notion that merge rules are very annotation
> dependent, I don't think we can find a one-size-fits-all solution for that.
> So this might be actually be an argument in favour of having annotations on
> PTransforms, since it avoids the conflation with environments.
> >
> > Also in general, I feel that having annotations per single transform
> (rather than composite) and on PTransforms could lead to a simpler design.
>
> If we want to use these for privacy, I don't see how attaching them to
> leaf transforms alone could work. (Even CombinePerKey is a composite.)
>
> > Seeing as there are valuable arguments in favour of both (PTransform and
> environments) with no clear(?) "best solution", I would propose moving
> forward with the initial (PTransform) design to ship the feature and
> unblock teams asking for it. If it turns out that there was indeed a need
> to have annotations in environments, we could always refactor it.
>
> I have yet to see any arguments that resource-level hints, such as
> memory or GPU, don't better belong on the environment. But moving
> forward on PTransform-level ones for logical statements like privacy
> declarations makes sense.
>

Exactly this. Properties of transforms make sense. The properties may hold
only of the whole subgraph. Even something as simple as "preserves keys".
This is analogous (but converse) to requirements like "requires sorted
input" which were explicitly excluded from the scope, which was about
hardware environment for execution.

The proposed scope and the proposed solution do not match and need to be
reconciled.

Kenn

> On 2020/11/17 19:07:22, Robert Bradshaw <[email protected]> wrote:
> > > So far we have two distinct usecases for annotations: resource hints
> > > and privacy directives, and I've been trying to figure out how to
> > > reconcile them, but they seem to have very different characteristics.
> > > (It would be nice to come up with other uses as well to see if we're
> > > really coming up with a generally useful mode--I think display data
> > > could fit into this as a new kind of annotation rather than being a
> > > top-level property, and it could make sense on both leaf and composite
> > > transforms.)
> > >
> > > To me, resource hints like GPU are inextricably tied to the
> > > environment. A transform tagged with GPU should reference a Fn that
> > > invokes GPU-accelerated code that lives in a particular environment.
> > > Something like high-mem is a bit squishier. Some DoFns take a lot of
> > > memory, but on the other hand one could imagine labeling a CoGBK as
> > > high-mem due to knowing that, in this particular usage, there will be
> > > lots of values with the same key. Ideally runners would be intelligent
> > > enough to automatically learn memory usage, but even in this case it
> > > may be a good hint to try and learn the requirements for DoFn A and
> > > DoFn B separately (which is difficult if they are always colocated,
> > > but valuable if, e.g. A takes a huge amount of memory and B takes a
> > > huge amount of wall time).
> > >
> > > Note that tying things to the environment does not preclude using them
> > > in non-portable runners as they'll still have an SDK-level
> > > representation (though I don't think we should have an explicit goal
> > > of feature parity for non-portable runners, e.g. multi-language isn't
> > > happening, and hope that non-portable runners go away soon anyway).
> > >
> > > Now let's consider privacy annotations. To make things very concrete,
> > > imagine a transform AverageSpendPerZipCode which takes as input (user,
> > > zip, spend), all users unique, and returns (zip, avg(spend)). In
> > > Python, this is GroupBy('zip').aggregate_field('spend',
> > > MeanCombineFn()). This is not very privacy preserving to those users
> > > who are the only (or one of a few) in a zip code. So we could define a
> > > transform PrivacyPreservingAverageSpendPerZipCode as
> > >
> > > @ptransform_fn
> > > def PrivacyPreservingAverageSpendPerZipCode(spend_per_user, threshold)
> > >     counts_per_zip = spend_per_user |
> > > GroupBy('zip').aggregate_field('user', CountCombineFn())
> > >     spend_per_zip = spend_per_user |
> > > GroupBy('zip').aggregate_field('spend', MeanCombineFn())
> > >     filtered = spend_per_zip | beam.Filter(
> > >         lambda x, counts: counts[x.zip] > threshold,
> > > counts=AsMap(counts_per_zip))
> > >     return filtered
> > >
> > > We now have a composite that has privacy preserving properties (i.e.
> > > the input may be quite sensitive, but the output is not, depending on
> > > the value of threshold). What is interesting here is that it is only
> > > the composite that has this property--no individual sub-transform is
> > > itself privacy preserving. Furthermore, an optimizer may notice we're
> > > doing aggregation on the same key twice and rewrite this using
> > > (logically)
> > >
> > >     GroupBy('zip').aggregate_field('user',
> > > CountCombineFn()).aggregate_field('spend', MeanCombineFn())
> > >
> > > and then applying the filter, which is semantically equivalent and
> > > satisfies the privacy annotations (and notably that does not even
> > > require the optimizer to interpret the annotations, just pass them
> > > on). To me, this implies that these annotations belong on the
> > > composites, and *not* on the leaf nodes (where they would be
> > > incorrect).
> > >
> > > I'll leave aside most questions of API until we figure out the model
> > > semantics, but wanted to throw one possible idea out (though I am
> > > ambivalent about it). Instead of attaching things to transforms, we
> > > can just wrap transforms in composites that have no role other than
> > > declaring information about their contents. E.g. we could have a
> > > composite transform whose payload is simply an assertion of the
> > > privacy (or resource?) properties of its inner structure. This would
> > > be just as expressive as adding new properties to transforms
> > > themselves (but would add an extra level of nesting, and make
> > > respecting the precice nesting more important).
> > >
> > > On Tue, Nov 17, 2020 at 8:12 AM Robert Burke <[email protected]>
> wrote:
> > > >
> > > > +1 to discussing PCollection annotations on a separate thread. It
> would be confusing to mix them up.
> > > >
> > > > -----------
> > > >
> > > > The question around conflicts is interesting, but confusing to me. I
> don't think they exist in general. I keep coming back around to that it
> depends on the annotation and the purpose of composites. Optionality saves
> us here too.
> > > >
> > > > Composites are nothing without their internal hypergraph structure.
> Eventually it comes down to executing the leaf nodes. The alternative to
> executing the leaf nodes is when the composite represents a known transform
> and is replaced by the runner on submission time.  Lets look at each.
> > > >
> > > > If there's a property that only exists on the leaf nodes, then it's
> not possible to bubble up that property to the composite in all cases.
> Afterall, it's not necessarily the case that a privacy preserving transform
> maintains the property for all output edges as not all such edges pass
> through the preserving transform.
> > > >
> > > > On the other hand, with memory or gpu recommendations, that might
> set a low bar on the composite level.
> > > >
> > > > But, composites (any transform really) can be runner replaced. I
> think it's fair to say that a runner replaced composite is not beholden to
> the annotations of the original leaf transforms, especially around physical
> requirements. The implementations are different. If a known composite at
> the composite level requires GPUs and it's known replacement doesn't, I'd
> posit that replacement was a choice the runner made since it can't
> provision machines with GPUs.
> > > >
> > > > But, crucially around privacy annotated transforms, a runner likely
> shouldn't replace a given subgraph that contains a privacy annotationed
> transform unless the replacements provide the same level of privacy.
> However, such replacements only happens with well known transforms with
> known properties anyway, so this can serve as an additional layer of
> validation for a runner aware of the properties.
> > > >
> > > > This brings me back to my position: that the notion of conflicts is
> very annotation dependant, and that defining them as optional is the most
> important feature to avoid issues. Conflicts don't exist as an inherent
> property of annotations on ptransform of the hypergraph structure. Am i
> wrong? No one has come up with an actual example of a conflict as far as i
> understand the thread.
> > > >
> > > > Even Reuven's original question is more about whether the runner is
> forced to look at leaf bodes rather than only looking at the composite.
> Assuming the composite isn't replaced, the runner needs to look at the leaf
> nodes regardless. And as discussed above there's no generalized semantics
> that fit for all kinds of annotations, once replacements are also
> considered.
> > > >
> > > > On Tue, Nov 17, 2020, 6:35 AM Ismaël Mejía <[email protected]>
> wrote:
> > > >>
> > > >> +1 Nice to see there is finally interest on this. Annotations for
> > > >> PTransforms make total sense!
> > > >>
> > > >> The semantics should be strictly optional for runners and correct
> > > >> execution should not be affected by lack of support of any
> annotation.
> > > >> We should however keep the set of annotations small.
> > > >>
> > > >> > PTransforms are hierarchical - namely a PTransform contains other
> PTransforms, and so on. Is the runner expected to resolve all annotations
> down to leaf nodes? What happens if that results in conflicting annotations?
> > > >>
> > > >> +1 to this question, This needs to be detailed.
> > > >>
> > > >> I am curious about how the end user APIs of this will look maybe in
> > > >> Java or Python, just an extra method to inject a Map or via Java
> > > >> annotations/Python decorators?
> > > >>
> > > >> We might prefer not to mix the concepts of annotations and
> > > >> environments because this will limit the scope of annotations.
> > > >> Annotations are different from environments because they serve a
> more
> > > >> general idea: to express an intention and it is up to the runner to
> > > >> choose the strategy to accomplish this, for example in the GPU
> > > >> assignation case it could be to rewrite resource allocation via
> > > >> Environments but it could also just delegate this to a resource
> > > >> manager which is what we could do for example for GPU (or data
> > > >> locality) cases on the Spark/Flink classic runners. If we tie this
> to
> > > >> environments we will leave classic runners out of the loop for no
> > > >> major reason and also not cover use cases not related to resource
> > > >> allocation.
> > > >>
> > > >> I do not understand the use case to justify PCollection annotations
> > > >> but to not mix this thread with them, would you be interested to
> > > >> elaborate more about them in a different thread Jan?
> > > >>
> > > >> On Tue, Nov 17, 2020 at 2:28 AM Robert Bradshaw <
> [email protected]> wrote:
> > > >> >
> > > >> > I agree things like GPU, high-mem, etc. belong to the
> environment. If
> > > >> > annotations are truly advisory, one can imagine merging
> environments
> > > >> > by taking the union of annotations and still producing a correct
> > > >> > pipeline. (This would mean that annotations would have to be a
> > > >> > multi-map...)
> > > >> >
> > > >> > On the other hand, this doesn't seem to handle the case of privacy
> > > >> > analysis, which could apply to composites without applying to any
> > > >> > individual component, and don't really make sense as part of a
> > > >> > fusion/execution story.
> > > >> >
> > > >> > On Mon, Nov 16, 2020 at 4:00 PM Robert Burke <[email protected]>
> wrote:
> > > >> > >
> > > >> > > That's good historical context.
> > > >> > >
> > > >> > > But then we'd still need to codify the annotation would need to
> be optional, and not affect correctness.
> > > >> > >
> > > >> > > Conflicts become easier to manage, (as environments with
> conflicting annotations simply don't get merged, and stay as distinct
> environments) but are still notionally annotation dependant. Do most
> runners handle environments so individually though?
> > > >> > >
> > > >> > > Reuven's question is a good one though. For the Go SDK, and the
> proposed implementation i saw, they only applied to leaf nodes. This is an
> artifact of how the Go SDK handles composites. Nothing stops it from being
> implemented on the composites Go has, but it didn't make sense otherwise.
> AFAICT Composites are generally for organizational convenience and not for
> functional aspects. Is this wrong? Afterall, does it make sense for
> environments to be on leaves and composites either? It's the same issue
> there.
> > > >> > >
> > > >> > >
> > > >> > > On Mon, Nov 16, 2020, 3:45 PM Kenneth Knowles <[email protected]>
> wrote:
> > > >> > >>
> > > >> > >> I am +1 to the proposal but believe it should be moved to the
> Environment. I could be convinced otherwise, but would want to really
> understand the details.
> > > >> > >>
> > > >> > >> I think we haven't done a great job communicating the purpose
> of the Environment proto. It was explicitly created for this purpose.
> > > >> > >>
> > > >> > >> 1. It tells the runner things it needs to know to interpret
> the DoFn (or other UDF). So these are the existing proto fields like docker
> image (in the payload) and required artifacts that were staged.
> > > >> > >> 2. It is also the place for additional requirements or hints
> like "high mem" or "GPU" etc.
> > > >> > >>
> > > >> > >> Every user function has an associated environment, and
> environments exist only for the purpose of executing user functions. In
> fact, Environment originated as inline requirements/attributes for a user
> function proto message and was separated just to make the proto smaller.
> > > >> > >>
> > > >> > >> A PTransform is an abstract concept for organizing the graph,
> not an executable thing. So a hint/capability/requirement on a PTransform
> only really makes sense as a scoping mechanism for applying a hint to a
> bunch of functions within a subgraph. This seems like a user interface
> concern and the SDK should own propagating the hints. If the hint truly
> applies to the whole PTransform and *not* the parts, then I am interested
> in learning about that.
> > > >> > >>
> > > >> > >> Kenn
> > > >> > >>
> > > >> > >> On Mon, Nov 16, 2020 at 10:54 AM Robert Burke <
> [email protected]> wrote:
> > > >> > >>>
> > > >> > >>> That's a good question.
> > > >> > >>>
> > > >> > >>> I think the main difference is a matter of scope. Annotations
> would apply to a PTransform while an environment applies to sets of
> transforms. A difference is the optional nature of the annotations they
> don't affect correctness. Runners don't need to do anything with them and
> still execute the pipeline correctly.
> > > >> > >>>
> > > >> > >>> Consider a privacy analysis on a pipeline graph. An
> annotation indicating that a transform provides a certain level of
> anonymization can be used in an analysis to determine if the downstream
> transforms are encountering raw data or not.
> > > >> > >>>
> > > >> > >>> From my understanding (which can be wrong) environments are
> rigid. Transforms in different environments can't be fused. "This is the
> python env", "this is the java env" can't be merged together. It's not
> clear to me that we have defined when environments are safely fuseable
> outside of equality. There's value in that simplicity.
> > > >> > >>>
> > > >> > >>> AFIACT environment has less to do with the machines a
> pipeline is executing on than it does about the kinds of SDK pipelines it
> understands and can execute.
> > > >> > >>>
> > > >> > >>>
> > > >> > >>>
> > > >> > >>> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova <
> [email protected]> wrote:
> > > >> > >>>>>
> > > >> > >>>>>
> > > >> > >>>>> Another example of an optional annotation is marking a
> transform to run on secure hardware, or to give hints to profiling/dynamic
> analysis tools.
> > > >> > >>>>
> > > >> > >>>>
> > > >> > >>>> There seems to be a lot of overlap between this idea and
> Environments.  Can you talk about how you feel they may be different or
> related?  For example, I could see annotations as a way of tagging
> transforms with an Environment, or I could see Environments becoming a
> specialized form of annotation.
> > > >> > >>>>
> > > >> > >>>> -chad
> > > >> > >>>>
> > >
>

Re: PTransform Annotations Proposal

Reply via email to