I took a stab at implementing Python API for expressing resource hints[1] in line with the discussion in this thread and added a few clarifications to the design [2].
Looking forward to having resource hints in Beam. Thanks! [1] https://github.com/apache/beam/pull/14390 [2] https://docs.google.com/document/d/1phExeGD1gdDI9M8LK4ZG57UGa7dswpB8Aj6jxWj4uQk/edit?usp=sharing On Tue, Feb 2, 2021 at 9:01 AM Sunil Pedapudi <s...@google.com> wrote: > Thanks very much, Mirac. I added a few comments as well, and the proposal > looks good to me. Thanks for all the discussion. > > On 2021/01/14 20:24:19, Robert Bradshaw <rober...@google.com> wrote: > > Thanks. I added some comments on the proposal. > > > > On Wed, Jan 13, 2021 at 9:21 AM Mirac Vuslat Basaran <mir...@google.com> > > wrote: > > > > > Wrote a design draft for resource-related annotations. Please have a > look: > > > > https://docs.google.com/document/d/1phExeGD1gdDI9M8LK4ZG57UGa7dswpB8Aj6jxWj4uQk/edit?usp=sharing > > > > > > Cheers, > > > Mirac > > > On 2020/11/26 20:20:09, Mirac Vuslat Basaran <mir...@google.com> > wrote: > > > > Created a PR without unit tests at > > > https://github.com/apache/beam/pull/13434. Please have a look. > > > > > > > > Thanks, > > > > Mirac > > > > > > > > > > > > On 2020/11/25 18:50:19, Robert Burke <rob...@frantil.com> wrote: > > > > > Hmmm. Fair. I'm mostly concerned about the pathological case where > we > > > end > > > > > up with a distinct Environment per transform, but there are likely > > > > > practical cases where that's reasonable (High mem to GPU to TPU, to > > > ARM....) > > > > > > > > > > On Wed, Nov 25, 2020, 10:42 AM Robert Bradshaw < > rober...@google.com> > > > wrote: > > > > > > > > > > > I'd like to continue the discussion *and* see an implementation > for > > > > > > the part we've settled on. I was asking why not have "every > distinct > > > > > > physical concern means a distinct environment?" > > > > > > > > > > > > On Wed, Nov 25, 2020 at 10:38 AM Robert Burke < > rob...@frantil.com> > > > wrote: > > > > > > > > > > > > > > Mostly because perfect is the enemy of good enough. We have a > > > proposal, > > > > > > we have clear boundaries for it. It's fine if the discussion > > > continues, but > > > > > > I see no evidence of concerns that should prevent starting an > > > > > > implementation, because it seems we'll need both anyway. > > > > > > > > > > > > > > On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw < > > > rober...@google.com> > > > > > > wrote: > > > > > > >> > > > > > > >> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke < > rob...@frantil.com > > > > > > > > > > wrote: > > > > > > >> > > > > > > > >> > It sounds like we've come to the position that > non-correctness > > > > > > affecting Ptransform Annotations are valuable at both leaf and > > > composite > > > > > > levels, and don't remove the potential need for a similar > feature on > > > > > > Environments, to handle physical concerns equirements for worker > > > processes > > > > > > to have (such as Ram, CPU, or GPU requirements.) > > > > > > >> > > > > > > > >> > Kenn, it's not clear what part of the solution (an > annotation > > > field > > > > > > on the Ptransform proto message) would need to change to satisfy > > > your scope > > > > > > concern, beyond documenting unambiguously that these may not be > used > > > for > > > > > > physical concerns or things that affect correctness. > > > > > > >> > > > > > > >> I'll let Kenn answer as well, but from my point of view, > > > explicitly > > > > > > >> having somewhere better to put these things would help. > > > > > > >> > > > > > > >> > I'm also unclear your scope concern not matching, given the > > > above. > > > > > > Your first paragraph reads very supportive of logical > annotations on > > > > > > Ptransforms, and that matches 1-1 with the current proposed > > > solution. Can > > > > > > you clarify your concern? > > > > > > >> > > > > > > > >> > I don't wish to scope creep on the physical requirements > issue > > > at > > > > > > this time. It seems we are agreed they should end up on > > > environments, but > > > > > > I'm not seeing proposals on the right way to execute them at this > > > time.They > > > > > > seem to be a fruitful topic of discussion, in particular > > > > > > unifying/consolidating them for efficient use of resources. I > don't > > > think > > > > > > we want to end up in a state where every distinct physical > concern > > > means a > > > > > > distinct environment. > > > > > > >> > > > > > > >> Why not? Assuming, of course, that runners are free to merge > > > > > > >> environments (merging those resource hints they understand > and are > > > > > > >> otherwise compatible, and discarding those they don't) for > > > efficient > > > > > > >> execution. > > > > > > >> > > > > > > >> > I for one am ready to see a PR. > > > > > > >> > > > > > > >> +1 > > > > > > >> > > > > > > >> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles < > k...@apache.org> > > > > > > wrote: > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw < > > > rober...@google.com> > > > > > > wrote: > > > > > > >> >>> > > > > > > >> >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran < > > > > > > mir...@google.com> wrote: > > > > > > >> >>> > > > > > > > >> >>> > Thanks everyone so much for their input and for the > > > insightful > > > > > > discussion. > > > > > > >> >>> > > > > > > > >> >>> > Not being knowledgeable about Beam's internals, I have > to > > > say I > > > > > > am a bit lost on the PTransform vs. environment discussion. > > > > > > >> >>> > > > > > > > >> >>> > I do agree with Burke's notion that merge rules are very > > > > > > annotation dependent, I don't think we can find a > one-size-fits-all > > > > > > solution for that. So this might be actually be an argument in > > > favour of > > > > > > having annotations on PTransforms, since it avoids the conflation > > > with > > > > > > environments. > > > > > > >> >>> > > > > > > > >> >>> > Also in general, I feel that having annotations per > single > > > > > > transform (rather than composite) and on PTransforms could lead > to a > > > > > > simpler design. > > > > > > >> >>> > > > > > > >> >>> If we want to use these for privacy, I don't see how > > > attaching them > > > > > > to > > > > > > >> >>> leaf transforms alone could work. (Even CombinePerKey is a > > > > > > composite.) > > > > > > >> >>> > > > > > > >> >>> > Seeing as there are valuable arguments in favour of both > > > > > > (PTransform and environments) with no clear(?) "best solution", I > > > would > > > > > > propose moving forward with the initial (PTransform) design to > ship > > > the > > > > > > feature and unblock teams asking for it. If it turns out that > there > > > was > > > > > > indeed a need to have annotations in environments, we could > always > > > refactor > > > > > > it. > > > > > > >> >>> > > > > > > >> >>> I have yet to see any arguments that resource-level hints, > > > such as > > > > > > >> >>> memory or GPU, don't better belong on the environment. But > > > moving > > > > > > >> >>> forward on PTransform-level ones for logical statements > like > > > privacy > > > > > > >> >>> declarations makes sense. > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> Exactly this. Properties of transforms make sense. The > > > properties > > > > > > may hold only of the whole subgraph. Even something as simple as > > > "preserves > > > > > > keys". This is analogous (but converse) to requirements like > > > "requires > > > > > > sorted input" which were explicitly excluded from the scope, > which > > > was > > > > > > about hardware environment for execution. > > > > > > >> >> > > > > > > >> >> The proposed scope and the proposed solution do not match > and > > > need > > > > > > to be reconciled. > > > > > > >> >> > > > > > > >> >> Kenn > > > > > > >> >> > > > > > > >> >>> > On 2020/11/17 19:07:22, Robert Bradshaw < > > > rober...@google.com> > > > > > > wrote: > > > > > > >> >>> > > So far we have two distinct usecases for annotations: > > > resource > > > > > > hints > > > > > > >> >>> > > and privacy directives, and I've been trying to figure > > > out how > > > > > > to > > > > > > >> >>> > > reconcile them, but they seem to have very different > > > > > > characteristics. > > > > > > >> >>> > > (It would be nice to come up with other uses as well > to > > > see if > > > > > > we're > > > > > > >> >>> > > really coming up with a generally useful mode--I think > > > display > > > > > > data > > > > > > >> >>> > > could fit into this as a new kind of annotation rather > > > than > > > > > > being a > > > > > > >> >>> > > top-level property, and it could make sense on both > leaf > > > and > > > > > > composite > > > > > > >> >>> > > transforms.) > > > > > > >> >>> > > > > > > > > >> >>> > > To me, resource hints like GPU are inextricably tied > to > > > the > > > > > > >> >>> > > environment. A transform tagged with GPU should > reference > > > a Fn > > > > > > that > > > > > > >> >>> > > invokes GPU-accelerated code that lives in a > particular > > > > > > environment. > > > > > > >> >>> > > Something like high-mem is a bit squishier. Some DoFns > > > take a > > > > > > lot of > > > > > > >> >>> > > memory, but on the other hand one could imagine > labeling a > > > > > > CoGBK as > > > > > > >> >>> > > high-mem due to knowing that, in this particular > usage, > > > there > > > > > > will be > > > > > > >> >>> > > lots of values with the same key. Ideally runners > would be > > > > > > intelligent > > > > > > >> >>> > > enough to automatically learn memory usage, but even > in > > > this > > > > > > case it > > > > > > >> >>> > > may be a good hint to try and learn the requirements > for > > > DoFn A > > > > > > and > > > > > > >> >>> > > DoFn B separately (which is difficult if they are > always > > > > > > colocated, > > > > > > >> >>> > > but valuable if, e.g. A takes a huge amount of memory > and > > > B > > > > > > takes a > > > > > > >> >>> > > huge amount of wall time). > > > > > > >> >>> > > > > > > > > >> >>> > > Note that tying things to the environment does not > > > preclude > > > > > > using them > > > > > > >> >>> > > in non-portable runners as they'll still have an > SDK-level > > > > > > >> >>> > > representation (though I don't think we should have an > > > explicit > > > > > > goal > > > > > > >> >>> > > of feature parity for non-portable runners, e.g. > > > multi-language > > > > > > isn't > > > > > > >> >>> > > happening, and hope that non-portable runners go away > soon > > > > > > anyway). > > > > > > >> >>> > > > > > > > > >> >>> > > Now let's consider privacy annotations. To make things > > > very > > > > > > concrete, > > > > > > >> >>> > > imagine a transform AverageSpendPerZipCode which > takes as > > > input > > > > > > (user, > > > > > > >> >>> > > zip, spend), all users unique, and returns (zip, > > > avg(spend)). In > > > > > > >> >>> > > Python, this is > GroupBy('zip').aggregate_field('spend', > > > > > > >> >>> > > MeanCombineFn()). This is not very privacy preserving > to > > > those > > > > > > users > > > > > > >> >>> > > who are the only (or one of a few) in a zip code. So > we > > > could > > > > > > define a > > > > > > >> >>> > > transform PrivacyPreservingAverageSpendPerZipCode as > > > > > > >> >>> > > > > > > > > >> >>> > > @ptransform_fn > > > > > > >> >>> > > def > > > PrivacyPreservingAverageSpendPerZipCode(spend_per_user, > > > > > > threshold) > > > > > > >> >>> > > counts_per_zip = spend_per_user | > > > > > > >> >>> > > GroupBy('zip').aggregate_field('user', > CountCombineFn()) > > > > > > >> >>> > > spend_per_zip = spend_per_user | > > > > > > >> >>> > > GroupBy('zip').aggregate_field('spend', > MeanCombineFn()) > > > > > > >> >>> > > filtered = spend_per_zip | beam.Filter( > > > > > > >> >>> > > lambda x, counts: counts[x.zip] > threshold, > > > > > > >> >>> > > counts=AsMap(counts_per_zip)) > > > > > > >> >>> > > return filtered > > > > > > >> >>> > > > > > > > > >> >>> > > We now have a composite that has privacy preserving > > > properties > > > > > > (i.e. > > > > > > >> >>> > > the input may be quite sensitive, but the output is > not, > > > > > > depending on > > > > > > >> >>> > > the value of threshold). What is interesting here is > that > > > it is > > > > > > only > > > > > > >> >>> > > the composite that has this property--no individual > > > > > > sub-transform is > > > > > > >> >>> > > itself privacy preserving. Furthermore, an optimizer > may > > > notice > > > > > > we're > > > > > > >> >>> > > doing aggregation on the same key twice and rewrite > this > > > using > > > > > > >> >>> > > (logically) > > > > > > >> >>> > > > > > > > > >> >>> > > GroupBy('zip').aggregate_field('user', > > > > > > >> >>> > > CountCombineFn()).aggregate_field('spend', > > > MeanCombineFn()) > > > > > > >> >>> > > > > > > > > >> >>> > > and then applying the filter, which is semantically > > > equivalent > > > > > > and > > > > > > >> >>> > > satisfies the privacy annotations (and notably that > does > > > not > > > > > > even > > > > > > >> >>> > > require the optimizer to interpret the annotations, > just > > > pass > > > > > > them > > > > > > >> >>> > > on). To me, this implies that these annotations > belong on > > > the > > > > > > >> >>> > > composites, and *not* on the leaf nodes (where they > would > > > be > > > > > > >> >>> > > incorrect). > > > > > > >> >>> > > > > > > > > >> >>> > > I'll leave aside most questions of API until we figure > > > out the > > > > > > model > > > > > > >> >>> > > semantics, but wanted to throw one possible idea out > > > (though I > > > > > > am > > > > > > >> >>> > > ambivalent about it). Instead of attaching things to > > > > > > transforms, we > > > > > > >> >>> > > can just wrap transforms in composites that have no > role > > > other > > > > > > than > > > > > > >> >>> > > declaring information about their contents. E.g. we > could > > > have a > > > > > > >> >>> > > composite transform whose payload is simply an > assertion > > > of the > > > > > > >> >>> > > privacy (or resource?) properties of its inner > structure. > > > This > > > > > > would > > > > > > >> >>> > > be just as expressive as adding new properties to > > > transforms > > > > > > >> >>> > > themselves (but would add an extra level of nesting, > and > > > make > > > > > > >> >>> > > respecting the precice nesting more important). > > > > > > >> >>> > > > > > > > > >> >>> > > On Tue, Nov 17, 2020 at 8:12 AM Robert Burke < > > > > > > rob...@frantil.com> wrote: > > > > > > >> >>> > > > > > > > > > >> >>> > > > +1 to discussing PCollection annotations on a > separate > > > > > > thread. It would be confusing to mix them up. > > > > > > >> >>> > > > > > > > > > >> >>> > > > ----------- > > > > > > >> >>> > > > > > > > > > >> >>> > > > The question around conflicts is interesting, but > > > confusing > > > > > > to me. I don't think they exist in general. I keep coming back > > > around to > > > > > > that it depends on the annotation and the purpose of composites. > > > > > > Optionality saves us here too. > > > > > > >> >>> > > > > > > > > > >> >>> > > > Composites are nothing without their internal > hypergraph > > > > > > structure. Eventually it comes down to executing the leaf nodes. > The > > > > > > alternative to executing the leaf nodes is when the composite > > > represents a > > > > > > known transform and is replaced by the runner on submission time. > > > Lets > > > > > > look at each. > > > > > > >> >>> > > > > > > > > > >> >>> > > > If there's a property that only exists on the leaf > > > nodes, > > > > > > then it's not possible to bubble up that property to the > composite > > > in all > > > > > > cases. Afterall, it's not necessarily the case that a privacy > > > preserving > > > > > > transform maintains the property for all output edges as not all > > > such edges > > > > > > pass through the preserving transform. > > > > > > >> >>> > > > > > > > > > >> >>> > > > On the other hand, with memory or gpu > recommendations, > > > that > > > > > > might set a low bar on the composite level. > > > > > > >> >>> > > > > > > > > > >> >>> > > > But, composites (any transform really) can be runner > > > > > > replaced. I think it's fair to say that a runner replaced > composite > > > is not > > > > > > beholden to the annotations of the original leaf transforms, > > > especially > > > > > > around physical requirements. The implementations are different. > If > > > a known > > > > > > composite at the composite level requires GPUs and it's known > > > replacement > > > > > > doesn't, I'd posit that replacement was a choice the runner made > > > since it > > > > > > can't provision machines with GPUs. > > > > > > >> >>> > > > > > > > > > >> >>> > > > But, crucially around privacy annotated transforms, > a > > > runner > > > > > > likely shouldn't replace a given subgraph that contains a privacy > > > > > > annotationed transform unless the replacements provide the same > > > level of > > > > > > privacy. However, such replacements only happens with well known > > > transforms > > > > > > with known properties anyway, so this can serve as an additional > > > layer of > > > > > > validation for a runner aware of the properties. > > > > > > >> >>> > > > > > > > > > >> >>> > > > This brings me back to my position: that the notion > of > > > > > > conflicts is very annotation dependant, and that defining them as > > > optional > > > > > > is the most important feature to avoid issues. Conflicts don't > exist > > > as an > > > > > > inherent property of annotations on ptransform of the hypergraph > > > structure. > > > > > > Am i wrong? No one has come up with an actual example of a > conflict > > > as far > > > > > > as i understand the thread. > > > > > > >> >>> > > > > > > > > > >> >>> > > > Even Reuven's original question is more about > whether > > > the > > > > > > runner is forced to look at leaf bodes rather than only looking > at > > > the > > > > > > composite. Assuming the composite isn't replaced, the runner > needs > > > to look > > > > > > at the leaf nodes regardless. And as discussed above there's no > > > generalized > > > > > > semantics that fit for all kinds of annotations, once > replacements > > > are also > > > > > > considered. > > > > > > >> >>> > > > > > > > > > >> >>> > > > On Tue, Nov 17, 2020, 6:35 AM Ismaël Mejía < > > > ieme...@gmail.com> > > > > > > wrote: > > > > > > >> >>> > > >> > > > > > > >> >>> > > >> +1 Nice to see there is finally interest on this. > > > > > > Annotations for > > > > > > >> >>> > > >> PTransforms make total sense! > > > > > > >> >>> > > >> > > > > > > >> >>> > > >> The semantics should be strictly optional for > runners > > > and > > > > > > correct > > > > > > >> >>> > > >> execution should not be affected by lack of > support of > > > any > > > > > > annotation. > > > > > > >> >>> > > >> We should however keep the set of annotations > small. > > > > > > >> >>> > > >> > > > > > > >> >>> > > >> > PTransforms are hierarchical - namely a > PTransform > > > > > > contains other PTransforms, and so on. Is the runner expected to > > > resolve > > > > > > all annotations down to leaf nodes? What happens if that results > in > > > > > > conflicting annotations? > > > > > > >> >>> > > >> > > > > > > >> >>> > > >> +1 to this question, This needs to be detailed. > > > > > > >> >>> > > >> > > > > > > >> >>> > > >> I am curious about how the end user APIs of this > will > > > look > > > > > > maybe in > > > > > > >> >>> > > >> Java or Python, just an extra method to inject a > Map > > > or via > > > > > > Java > > > > > > >> >>> > > >> annotations/Python decorators? > > > > > > >> >>> > > >> > > > > > > >> >>> > > >> We might prefer not to mix the concepts of > annotations > > > and > > > > > > >> >>> > > >> environments because this will limit the scope of > > > > > > annotations. > > > > > > >> >>> > > >> Annotations are different from environments because > > > they > > > > > > serve a more > > > > > > >> >>> > > >> general idea: to express an intention and it is up > to > > > the > > > > > > runner to > > > > > > >> >>> > > >> choose the strategy to accomplish this, for > example in > > > the > > > > > > GPU > > > > > > >> >>> > > >> assignation case it could be to rewrite resource > > > allocation > > > > > > via > > > > > > >> >>> > > >> Environments but it could also just delegate this > to a > > > > > > resource > > > > > > >> >>> > > >> manager which is what we could do for example for > GPU > > > (or > > > > > > data > > > > > > >> >>> > > >> locality) cases on the Spark/Flink classic > runners. If > > > we > > > > > > tie this to > > > > > > >> >>> > > >> environments we will leave classic runners out of > the > > > loop > > > > > > for no > > > > > > >> >>> > > >> major reason and also not cover use cases not > related > > > to > > > > > > resource > > > > > > >> >>> > > >> allocation. > > > > > > >> >>> > > >> > > > > > > >> >>> > > >> I do not understand the use case to justify > PCollection > > > > > > annotations > > > > > > >> >>> > > >> but to not mix this thread with them, would you be > > > > > > interested to > > > > > > >> >>> > > >> elaborate more about them in a different thread > Jan? > > > > > > >> >>> > > >> > > > > > > >> >>> > > >> On Tue, Nov 17, 2020 at 2:28 AM Robert Bradshaw < > > > > > > rober...@google.com> wrote: > > > > > > >> >>> > > >> > > > > > > > >> >>> > > >> > I agree things like GPU, high-mem, etc. belong > to the > > > > > > environment. If > > > > > > >> >>> > > >> > annotations are truly advisory, one can imagine > > > merging > > > > > > environments > > > > > > >> >>> > > >> > by taking the union of annotations and still > > > producing a > > > > > > correct > > > > > > >> >>> > > >> > pipeline. (This would mean that annotations would > > > have to > > > > > > be a > > > > > > >> >>> > > >> > multi-map...) > > > > > > >> >>> > > >> > > > > > > > >> >>> > > >> > On the other hand, this doesn't seem to handle > the > > > case of > > > > > > privacy > > > > > > >> >>> > > >> > analysis, which could apply to composites without > > > applying > > > > > > to any > > > > > > >> >>> > > >> > individual component, and don't really make > sense as > > > part > > > > > > of a > > > > > > >> >>> > > >> > fusion/execution story. > > > > > > >> >>> > > >> > > > > > > > >> >>> > > >> > On Mon, Nov 16, 2020 at 4:00 PM Robert Burke < > > > > > > rob...@frantil.com> wrote: > > > > > > >> >>> > > >> > > > > > > > > >> >>> > > >> > > That's good historical context. > > > > > > >> >>> > > >> > > > > > > > > >> >>> > > >> > > But then we'd still need to codify the > annotation > > > would > > > > > > need to be optional, and not affect correctness. > > > > > > >> >>> > > >> > > > > > > > > >> >>> > > >> > > Conflicts become easier to manage, (as > > > environments with > > > > > > conflicting annotations simply don't get merged, and stay as > distinct > > > > > > environments) but are still notionally annotation dependant. Do > most > > > > > > runners handle environments so individually though? > > > > > > >> >>> > > >> > > > > > > > > >> >>> > > >> > > Reuven's question is a good one though. For > the Go > > > SDK, > > > > > > and the proposed implementation i saw, they only applied to leaf > > > nodes. > > > > > > This is an artifact of how the Go SDK handles composites. Nothing > > > stops it > > > > > > from being implemented on the composites Go has, but it didn't > make > > > sense > > > > > > otherwise. AFAICT Composites are generally for organizational > > > convenience > > > > > > and not for functional aspects. Is this wrong? Afterall, does it > > > make sense > > > > > > for environments to be on leaves and composites either? It's the > > > same issue > > > > > > there. > > > > > > >> >>> > > >> > > > > > > > > >> >>> > > >> > > > > > > > > >> >>> > > >> > > On Mon, Nov 16, 2020, 3:45 PM Kenneth Knowles < > > > > > > k...@apache.org> wrote: > > > > > > >> >>> > > >> > >> > > > > > > >> >>> > > >> > >> I am +1 to the proposal but believe it should > be > > > moved > > > > > > to the Environment. I could be convinced otherwise, but would > want to > > > > > > really understand the details. > > > > > > >> >>> > > >> > >> > > > > > > >> >>> > > >> > >> I think we haven't done a great job > communicating > > > the > > > > > > purpose of the Environment proto. It was explicitly created for > this > > > > > > purpose. > > > > > > >> >>> > > >> > >> > > > > > > >> >>> > > >> > >> 1. It tells the runner things it needs to > know to > > > > > > interpret the DoFn (or other UDF). So these are the existing > proto > > > fields > > > > > > like docker image (in the payload) and required artifacts that > were > > > staged. > > > > > > >> >>> > > >> > >> 2. It is also the place for additional > > > requirements or > > > > > > hints like "high mem" or "GPU" etc. > > > > > > >> >>> > > >> > >> > > > > > > >> >>> > > >> > >> Every user function has an associated > > > environment, and > > > > > > environments exist only for the purpose of executing user > functions. > > > In > > > > > > fact, Environment originated as inline requirements/attributes > for a > > > user > > > > > > function proto message and was separated just to make the proto > > > smaller. > > > > > > >> >>> > > >> > >> > > > > > > >> >>> > > >> > >> A PTransform is an abstract concept for > > > organizing the > > > > > > graph, not an executable thing. So a hint/capability/requirement > on a > > > > > > PTransform only really makes sense as a scoping mechanism for > > > applying a > > > > > > hint to a bunch of functions within a subgraph. This seems like a > > > user > > > > > > interface concern and the SDK should own propagating the hints. > If > > > the hint > > > > > > truly applies to the whole PTransform and *not* the parts, then > I am > > > > > > interested in learning about that. > > > > > > >> >>> > > >> > >> > > > > > > >> >>> > > >> > >> Kenn > > > > > > >> >>> > > >> > >> > > > > > > >> >>> > > >> > >> On Mon, Nov 16, 2020 at 10:54 AM Robert Burke > < > > > > > > rob...@frantil.com> wrote: > > > > > > >> >>> > > >> > >>> > > > > > > >> >>> > > >> > >>> That's a good question. > > > > > > >> >>> > > >> > >>> > > > > > > >> >>> > > >> > >>> I think the main difference is a matter of > scope. > > > > > > Annotations would apply to a PTransform while an environment > applies > > > to > > > > > > sets of transforms. A difference is the optional nature of the > > > annotations > > > > > > they don't affect correctness. Runners don't need to do anything > > > with them > > > > > > and still execute the pipeline correctly. > > > > > > >> >>> > > >> > >>> > > > > > > >> >>> > > >> > >>> Consider a privacy analysis on a pipeline > graph. > > > An > > > > > > annotation indicating that a transform provides a certain level > of > > > > > > anonymization can be used in an analysis to determine if the > > > downstream > > > > > > transforms are encountering raw data or not. > > > > > > >> >>> > > >> > >>> > > > > > > >> >>> > > >> > >>> From my understanding (which can be wrong) > > > > > > environments are rigid. Transforms in different environments > can't be > > > > > > fused. "This is the python env", "this is the java env" can't be > > > merged > > > > > > together. It's not clear to me that we have defined when > > > environments are > > > > > > safely fuseable outside of equality. There's value in that > > > simplicity. > > > > > > >> >>> > > >> > >>> > > > > > > >> >>> > > >> > >>> AFIACT environment has less to do with the > > > machines a > > > > > > pipeline is executing on than it does about the kinds of SDK > > > pipelines it > > > > > > understands and can execute. > > > > > > >> >>> > > >> > >>> > > > > > > >> >>> > > >> > >>> > > > > > > >> >>> > > >> > >>> > > > > > > >> >>> > > >> > >>> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova > < > > > > > > chad...@gmail.com> wrote: > > > > > > >> >>> > > >> > >>>>> > > > > > > >> >>> > > >> > >>>>> > > > > > > >> >>> > > >> > >>>>> Another example of an optional annotation > is > > > marking > > > > > > a transform to run on secure hardware, or to give hints to > > > > > > profiling/dynamic analysis tools. > > > > > > >> >>> > > >> > >>>> > > > > > > >> >>> > > >> > >>>> > > > > > > >> >>> > > >> > >>>> There seems to be a lot of overlap between > this > > > idea > > > > > > and Environments. Can you talk about how you feel they may be > > > different or > > > > > > related? For example, I could see annotations as a way of > tagging > > > > > > transforms with an Environment, or I could see Environments > becoming > > > a > > > > > > specialized form of annotation. > > > > > > >> >>> > > >> > >>>> > > > > > > >> >>> > > >> > >>>> -chad > > > > > > >> >>> > > >> > >>>> > > > > > > >> >>> > > > > > > > > > > > > > > > > > > > > > > >