Re: PTransform Annotations Proposal

2021-04-15 Thread Valentyn Tymofieiev
I took a stab at implementing Python API for expressing resource hints[1]
in line with the discussion in this thread and added a few clarifications
to the design [2].

Looking forward to having resource hints in Beam.

Thanks!

[1] https://github.com/apache/beam/pull/14390
[2]
https://docs.google.com/document/d/1phExeGD1gdDI9M8LK4ZG57UGa7dswpB8Aj6jxWj4uQk/edit?usp=sharing




On Tue, Feb 2, 2021 at 9:01 AM Sunil Pedapudi  wrote:

> Thanks very much, Mirac. I added a few comments as well, and the proposal
> looks good to me. Thanks for all the discussion.
>
> On 2021/01/14 20:24:19, Robert Bradshaw  wrote:
> > Thanks. I added some comments on the proposal.
> >
> > On Wed, Jan 13, 2021 at 9:21 AM Mirac Vuslat Basaran 
> > wrote:
> >
> > > Wrote a design draft for resource-related annotations. Please have a
> look:
> > >
> https://docs.google.com/document/d/1phExeGD1gdDI9M8LK4ZG57UGa7dswpB8Aj6jxWj4uQk/edit?usp=sharing
> > >
> > > Cheers,
> > > Mirac
> > > On 2020/11/26 20:20:09, Mirac Vuslat Basaran 
> wrote:
> > > > Created a PR without unit tests at
> > > https://github.com/apache/beam/pull/13434. Please have a look.
> > > >
> > > > Thanks,
> > > > Mirac
> > > >
> > > >
> > > > On 2020/11/25 18:50:19, Robert Burke  wrote:
> > > > > Hmmm. Fair. I'm mostly concerned about the pathological case where
> we
> > > end
> > > > > up with a distinct Environment per transform, but there are likely
> > > > > practical cases where that's reasonable (High mem to GPU to TPU, to
> > > ARM)
> > > > >
> > > > > On Wed, Nov 25, 2020, 10:42 AM Robert Bradshaw <
> rober...@google.com>
> > > wrote:
> > > > >
> > > > > > I'd like to continue the discussion *and* see an implementation
> for
> > > > > > the part we've settled on. I was asking why not have "every
> distinct
> > > > > > physical concern means a distinct environment?"
> > > > > >
> > > > > > On Wed, Nov 25, 2020 at 10:38 AM Robert Burke <
> rob...@frantil.com>
> > > wrote:
> > > > > > >
> > > > > > > Mostly because perfect is the enemy of good enough. We have a
> > > proposal,
> > > > > > we have clear boundaries for it. It's fine if the discussion
> > > continues, but
> > > > > > I see no evidence of concerns that should prevent starting an
> > > > > > implementation, because it seems we'll need both anyway.
> > > > > > >
> > > > > > > On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw <
> > > rober...@google.com>
> > > > > > wrote:
> > > > > > >>
> > > > > > >> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke <
> rob...@frantil.com
> > > >
> > > > > > wrote:
> > > > > > >> >
> > > > > > >> > It sounds like we've come to the position that
> non-correctness
> > > > > > affecting Ptransform Annotations are valuable at both leaf and
> > > composite
> > > > > > levels, and don't remove the potential need for a similar
> feature on
> > > > > > Environments, to handle physical concerns equirements for worker
> > > processes
> > > > > > to have (such as Ram, CPU, or GPU requirements.)
> > > > > > >> >
> > > > > > >> > Kenn, it's not clear what part of the solution (an
> annotation
> > > field
> > > > > > on the Ptransform proto message) would need to change to satisfy
> > > your scope
> > > > > > concern, beyond documenting unambiguously that these may not be
> used
> > > for
> > > > > > physical concerns or things that affect correctness.
> > > > > > >>
> > > > > > >> I'll let Kenn answer as well, but from my point of view,
> > > explicitly
> > > > > > >> having somewhere better to put these things would help.
> > > > > > >>
> > > > > > >> > I'm also unclear your scope concern not matching, given the
> > > above.
> > > > > > Your first paragraph reads very supportive of logical
> annotations on
> > > > > > Ptransforms, and that matches 1-1 with the current proposed
> > > solution. Can
> > > > > > you clarify your concern?
> > > > > > >> >
> > > > > > >> > I don't wish to scope creep on the physical requirements
> issue
> > > at
> > > > > > this time. It seems we are agreed they should end up on
> > > environments, but
> > > > > > I'm not seeing proposals on the right way to execute them at this
> > > time.They
> > > > > > seem to be a fruitful topic of discussion, in particular
> > > > > > unifying/consolidating them for efficient use of resources. I
> don't
> > > think
> > > > > > we want to end up in a state where every distinct physical
> concern
> > > means a
> > > > > > distinct environment.
> > > > > > >>
> > > > > > >> Why not? Assuming, of course, that runners are free to merge
> > > > > > >> environments (merging those resource hints they understand
> and are
> > > > > > >> otherwise compatible, and discarding those they don't) for
> > > efficient
> > > > > > >> execution.
> > > > > > >>
> > > > > > >> > I for one am ready to see a PR.
> > > > > > >>
> > > > > > >> +1
> > > > > > >>
> > > > > > >> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles <
> k...@apache.org>
> > > > > > wrote:
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >> On Mon, 

Re: PTransform Annotations Proposal

2021-02-02 Thread Sunil Pedapudi
Thanks very much, Mirac. I added a few comments as well, and the proposal looks 
good to me. Thanks for all the discussion.

On 2021/01/14 20:24:19, Robert Bradshaw  wrote: 
> Thanks. I added some comments on the proposal.
> 
> On Wed, Jan 13, 2021 at 9:21 AM Mirac Vuslat Basaran 
> wrote:
> 
> > Wrote a design draft for resource-related annotations. Please have a look:
> > https://docs.google.com/document/d/1phExeGD1gdDI9M8LK4ZG57UGa7dswpB8Aj6jxWj4uQk/edit?usp=sharing
> >
> > Cheers,
> > Mirac
> > On 2020/11/26 20:20:09, Mirac Vuslat Basaran  wrote:
> > > Created a PR without unit tests at
> > https://github.com/apache/beam/pull/13434. Please have a look.
> > >
> > > Thanks,
> > > Mirac
> > >
> > >
> > > On 2020/11/25 18:50:19, Robert Burke  wrote:
> > > > Hmmm. Fair. I'm mostly concerned about the pathological case where we
> > end
> > > > up with a distinct Environment per transform, but there are likely
> > > > practical cases where that's reasonable (High mem to GPU to TPU, to
> > ARM)
> > > >
> > > > On Wed, Nov 25, 2020, 10:42 AM Robert Bradshaw 
> > wrote:
> > > >
> > > > > I'd like to continue the discussion *and* see an implementation for
> > > > > the part we've settled on. I was asking why not have "every distinct
> > > > > physical concern means a distinct environment?"
> > > > >
> > > > > On Wed, Nov 25, 2020 at 10:38 AM Robert Burke 
> > wrote:
> > > > > >
> > > > > > Mostly because perfect is the enemy of good enough. We have a
> > proposal,
> > > > > we have clear boundaries for it. It's fine if the discussion
> > continues, but
> > > > > I see no evidence of concerns that should prevent starting an
> > > > > implementation, because it seems we'll need both anyway.
> > > > > >
> > > > > > On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw <
> > rober...@google.com>
> > > > > wrote:
> > > > > >>
> > > > > >> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke  > >
> > > > > wrote:
> > > > > >> >
> > > > > >> > It sounds like we've come to the position that non-correctness
> > > > > affecting Ptransform Annotations are valuable at both leaf and
> > composite
> > > > > levels, and don't remove the potential need for a similar feature on
> > > > > Environments, to handle physical concerns equirements for worker
> > processes
> > > > > to have (such as Ram, CPU, or GPU requirements.)
> > > > > >> >
> > > > > >> > Kenn, it's not clear what part of the solution (an annotation
> > field
> > > > > on the Ptransform proto message) would need to change to satisfy
> > your scope
> > > > > concern, beyond documenting unambiguously that these may not be used
> > for
> > > > > physical concerns or things that affect correctness.
> > > > > >>
> > > > > >> I'll let Kenn answer as well, but from my point of view,
> > explicitly
> > > > > >> having somewhere better to put these things would help.
> > > > > >>
> > > > > >> > I'm also unclear your scope concern not matching, given the
> > above.
> > > > > Your first paragraph reads very supportive of logical annotations on
> > > > > Ptransforms, and that matches 1-1 with the current proposed
> > solution. Can
> > > > > you clarify your concern?
> > > > > >> >
> > > > > >> > I don't wish to scope creep on the physical requirements issue
> > at
> > > > > this time. It seems we are agreed they should end up on
> > environments, but
> > > > > I'm not seeing proposals on the right way to execute them at this
> > time.They
> > > > > seem to be a fruitful topic of discussion, in particular
> > > > > unifying/consolidating them for efficient use of resources. I don't
> > think
> > > > > we want to end up in a state where every distinct physical concern
> > means a
> > > > > distinct environment.
> > > > > >>
> > > > > >> Why not? Assuming, of course, that runners are free to merge
> > > > > >> environments (merging those resource hints they understand and are
> > > > > >> otherwise compatible, and discarding those they don't) for
> > efficient
> > > > > >> execution.
> > > > > >>
> > > > > >> > I for one am ready to see a PR.
> > > > > >>
> > > > > >> +1
> > > > > >>
> > > > > >> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles 
> > > > > wrote:
> > > > > >> >>
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw <
> > rober...@google.com>
> > > > > wrote:
> > > > > >> >>>
> > > > > >> >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran <
> > > > > mir...@google.com> wrote:
> > > > > >> >>> >
> > > > > >> >>> > Thanks everyone so much for their input and for the
> > insightful
> > > > > discussion.
> > > > > >> >>> >
> > > > > >> >>> > Not being knowledgeable about Beam's internals, I have to
> > say I
> > > > > am a bit lost on the PTransform vs. environment discussion.
> > > > > >> >>> >
> > > > > >> >>> > I do agree with Burke's notion that merge rules are very
> > > > > annotation dependent, I don't think we can find a one-size-fits-all
> > > > > solution for that. So this might be actually be an argument in
> > favour of

Re: PTransform Annotations Proposal

2021-01-14 Thread Robert Bradshaw
Thanks. I added some comments on the proposal.

On Wed, Jan 13, 2021 at 9:21 AM Mirac Vuslat Basaran 
wrote:

> Wrote a design draft for resource-related annotations. Please have a look:
> https://docs.google.com/document/d/1phExeGD1gdDI9M8LK4ZG57UGa7dswpB8Aj6jxWj4uQk/edit?usp=sharing
>
> Cheers,
> Mirac
> On 2020/11/26 20:20:09, Mirac Vuslat Basaran  wrote:
> > Created a PR without unit tests at
> https://github.com/apache/beam/pull/13434. Please have a look.
> >
> > Thanks,
> > Mirac
> >
> >
> > On 2020/11/25 18:50:19, Robert Burke  wrote:
> > > Hmmm. Fair. I'm mostly concerned about the pathological case where we
> end
> > > up with a distinct Environment per transform, but there are likely
> > > practical cases where that's reasonable (High mem to GPU to TPU, to
> ARM)
> > >
> > > On Wed, Nov 25, 2020, 10:42 AM Robert Bradshaw 
> wrote:
> > >
> > > > I'd like to continue the discussion *and* see an implementation for
> > > > the part we've settled on. I was asking why not have "every distinct
> > > > physical concern means a distinct environment?"
> > > >
> > > > On Wed, Nov 25, 2020 at 10:38 AM Robert Burke 
> wrote:
> > > > >
> > > > > Mostly because perfect is the enemy of good enough. We have a
> proposal,
> > > > we have clear boundaries for it. It's fine if the discussion
> continues, but
> > > > I see no evidence of concerns that should prevent starting an
> > > > implementation, because it seems we'll need both anyway.
> > > > >
> > > > > On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw <
> rober...@google.com>
> > > > wrote:
> > > > >>
> > > > >> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke  >
> > > > wrote:
> > > > >> >
> > > > >> > It sounds like we've come to the position that non-correctness
> > > > affecting Ptransform Annotations are valuable at both leaf and
> composite
> > > > levels, and don't remove the potential need for a similar feature on
> > > > Environments, to handle physical concerns equirements for worker
> processes
> > > > to have (such as Ram, CPU, or GPU requirements.)
> > > > >> >
> > > > >> > Kenn, it's not clear what part of the solution (an annotation
> field
> > > > on the Ptransform proto message) would need to change to satisfy
> your scope
> > > > concern, beyond documenting unambiguously that these may not be used
> for
> > > > physical concerns or things that affect correctness.
> > > > >>
> > > > >> I'll let Kenn answer as well, but from my point of view,
> explicitly
> > > > >> having somewhere better to put these things would help.
> > > > >>
> > > > >> > I'm also unclear your scope concern not matching, given the
> above.
> > > > Your first paragraph reads very supportive of logical annotations on
> > > > Ptransforms, and that matches 1-1 with the current proposed
> solution. Can
> > > > you clarify your concern?
> > > > >> >
> > > > >> > I don't wish to scope creep on the physical requirements issue
> at
> > > > this time. It seems we are agreed they should end up on
> environments, but
> > > > I'm not seeing proposals on the right way to execute them at this
> time.They
> > > > seem to be a fruitful topic of discussion, in particular
> > > > unifying/consolidating them for efficient use of resources. I don't
> think
> > > > we want to end up in a state where every distinct physical concern
> means a
> > > > distinct environment.
> > > > >>
> > > > >> Why not? Assuming, of course, that runners are free to merge
> > > > >> environments (merging those resource hints they understand and are
> > > > >> otherwise compatible, and discarding those they don't) for
> efficient
> > > > >> execution.
> > > > >>
> > > > >> > I for one am ready to see a PR.
> > > > >>
> > > > >> +1
> > > > >>
> > > > >> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles 
> > > > wrote:
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw <
> rober...@google.com>
> > > > wrote:
> > > > >> >>>
> > > > >> >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran <
> > > > mir...@google.com> wrote:
> > > > >> >>> >
> > > > >> >>> > Thanks everyone so much for their input and for the
> insightful
> > > > discussion.
> > > > >> >>> >
> > > > >> >>> > Not being knowledgeable about Beam's internals, I have to
> say I
> > > > am a bit lost on the PTransform vs. environment discussion.
> > > > >> >>> >
> > > > >> >>> > I do agree with Burke's notion that merge rules are very
> > > > annotation dependent, I don't think we can find a one-size-fits-all
> > > > solution for that. So this might be actually be an argument in
> favour of
> > > > having annotations on PTransforms, since it avoids the conflation
> with
> > > > environments.
> > > > >> >>> >
> > > > >> >>> > Also in general, I feel that having annotations per single
> > > > transform (rather than composite) and on PTransforms could lead to a
> > > > simpler design.
> > > > >> >>>
> > > > >> >>> If we want to use these for privacy, I don't see how
> attaching them
> > > > to
> > > > >> 

Re: PTransform Annotations Proposal

2021-01-13 Thread Mirac Vuslat Basaran
Wrote a design draft for resource-related annotations. Please have a look: 
https://docs.google.com/document/d/1phExeGD1gdDI9M8LK4ZG57UGa7dswpB8Aj6jxWj4uQk/edit?usp=sharing

Cheers,
Mirac
On 2020/11/26 20:20:09, Mirac Vuslat Basaran  wrote: 
> Created a PR without unit tests at https://github.com/apache/beam/pull/13434. 
> Please have a look.
> 
> Thanks,
> Mirac
> 
> 
> On 2020/11/25 18:50:19, Robert Burke  wrote: 
> > Hmmm. Fair. I'm mostly concerned about the pathological case where we end
> > up with a distinct Environment per transform, but there are likely
> > practical cases where that's reasonable (High mem to GPU to TPU, to ARM)
> > 
> > On Wed, Nov 25, 2020, 10:42 AM Robert Bradshaw  wrote:
> > 
> > > I'd like to continue the discussion *and* see an implementation for
> > > the part we've settled on. I was asking why not have "every distinct
> > > physical concern means a distinct environment?"
> > >
> > > On Wed, Nov 25, 2020 at 10:38 AM Robert Burke  wrote:
> > > >
> > > > Mostly because perfect is the enemy of good enough. We have a proposal,
> > > we have clear boundaries for it. It's fine if the discussion continues, 
> > > but
> > > I see no evidence of concerns that should prevent starting an
> > > implementation, because it seems we'll need both anyway.
> > > >
> > > > On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw 
> > > wrote:
> > > >>
> > > >> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke 
> > > wrote:
> > > >> >
> > > >> > It sounds like we've come to the position that non-correctness
> > > affecting Ptransform Annotations are valuable at both leaf and composite
> > > levels, and don't remove the potential need for a similar feature on
> > > Environments, to handle physical concerns equirements for worker processes
> > > to have (such as Ram, CPU, or GPU requirements.)
> > > >> >
> > > >> > Kenn, it's not clear what part of the solution (an annotation field
> > > on the Ptransform proto message) would need to change to satisfy your 
> > > scope
> > > concern, beyond documenting unambiguously that these may not be used for
> > > physical concerns or things that affect correctness.
> > > >>
> > > >> I'll let Kenn answer as well, but from my point of view, explicitly
> > > >> having somewhere better to put these things would help.
> > > >>
> > > >> > I'm also unclear your scope concern not matching, given the above.
> > > Your first paragraph reads very supportive of logical annotations on
> > > Ptransforms, and that matches 1-1 with the current proposed solution. Can
> > > you clarify your concern?
> > > >> >
> > > >> > I don't wish to scope creep on the physical requirements issue at
> > > this time. It seems we are agreed they should end up on environments, but
> > > I'm not seeing proposals on the right way to execute them at this 
> > > time.They
> > > seem to be a fruitful topic of discussion, in particular
> > > unifying/consolidating them for efficient use of resources. I don't think
> > > we want to end up in a state where every distinct physical concern means a
> > > distinct environment.
> > > >>
> > > >> Why not? Assuming, of course, that runners are free to merge
> > > >> environments (merging those resource hints they understand and are
> > > >> otherwise compatible, and discarding those they don't) for efficient
> > > >> execution.
> > > >>
> > > >> > I for one am ready to see a PR.
> > > >>
> > > >> +1
> > > >>
> > > >> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles 
> > > wrote:
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw 
> > > >> >> 
> > > wrote:
> > > >> >>>
> > > >> >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran <
> > > mir...@google.com> wrote:
> > > >> >>> >
> > > >> >>> > Thanks everyone so much for their input and for the insightful
> > > discussion.
> > > >> >>> >
> > > >> >>> > Not being knowledgeable about Beam's internals, I have to say I
> > > am a bit lost on the PTransform vs. environment discussion.
> > > >> >>> >
> > > >> >>> > I do agree with Burke's notion that merge rules are very
> > > annotation dependent, I don't think we can find a one-size-fits-all
> > > solution for that. So this might be actually be an argument in favour of
> > > having annotations on PTransforms, since it avoids the conflation with
> > > environments.
> > > >> >>> >
> > > >> >>> > Also in general, I feel that having annotations per single
> > > transform (rather than composite) and on PTransforms could lead to a
> > > simpler design.
> > > >> >>>
> > > >> >>> If we want to use these for privacy, I don't see how attaching them
> > > to
> > > >> >>> leaf transforms alone could work. (Even CombinePerKey is a
> > > composite.)
> > > >> >>>
> > > >> >>> > Seeing as there are valuable arguments in favour of both
> > > (PTransform and environments) with no clear(?) "best solution", I would
> > > propose moving forward with the initial (PTransform) design to ship the
> > > feature and unblock teams asking for it. If 

Re: PTransform Annotations Proposal

2020-11-30 Thread Kenneth Knowles
On Wed, Nov 25, 2020 at 10:25 AM Robert Bradshaw 
wrote:

> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke  wrote:
> >
> > It sounds like we've come to the position that non-correctness affecting
> Ptransform Annotations are valuable at both leaf and composite levels, and
> don't remove the potential need for a similar feature on Environments, to
> handle physical concerns equirements for worker processes to have (such as
> Ram, CPU, or GPU requirements.)
> >
> > Kenn, it's not clear what part of the solution (an annotation field on
> the Ptransform proto message) would need to change to satisfy your scope
> concern, beyond documenting unambiguously that these may not be used for
> physical concerns or things that affect correctness.
>
> I'll let Kenn answer as well, but from my point of view, explicitly
> having somewhere better to put these things would help.
>

I mean that the primary type of use case for PTransform-level annotations
(properties of a PTransform) was explicitly excluded from scope in the
initial email, while the primary type of use case for Environment (resource
hints / how to execute a DoFn) was used in the motivating examples. In
subsequent discussion, privacy properties have been used as a motivating
example. I honestly don't know which of these is hypothetical and which is
actually what Mirac is trying to get done.

I'm +1 generally on adding something to the model to indicate logical
properties of a PTransform/subgraph. I would be disappointed to see this
used for resource hints.

I do think that "annotation" being so generic makes it a tempting way to do
things that you should do a different way, or shouldn't do at all, while
also limiting the ability to analyze their structure. For example,
properties of a PTransform can be covariant ("output is stable
PCollection") or contravariant ("requires stable input") or
preservation/parametric (various forms of "privacy preserving"). This
structure is not reflected in the feature, precluding designs that take
advantage of it (IIRC Luke had a clever proposal for safely having
runner-opaque contravariant properties). That's OK. Sometimes it is a good
way forward to just add a generic escape hatch. It is analogous to the
difference between a CLI offering --foobizzle=3 versus a CLI offering
--options='{"foobizzle":"3"}'. The former is almost always preferable, once
you have sorted out some clear and stable idea, but starting with the
latter is also a fine way to discover the structure you couldn't identify
(or didn't seem that important) up front.


>
> > I'm also unclear your scope concern not matching, given the above. Your
> first paragraph reads very supportive of logical annotations on
> Ptransforms, and that matches 1-1 with the current proposed solution. Can
> you clarify your concern?
> >
> > I don't wish to scope creep on the physical requirements issue at this
> time. It seems we are agreed they should end up on environments, but I'm
> not seeing proposals on the right way to execute them at this time.They
> seem to be a fruitful topic of discussion, in particular
> unifying/consolidating them for efficient use of resources. I don't think
> we want to end up in a state where every distinct physical concern means a
> distinct environment.
>
> Why not? Assuming, of course, that runners are free to merge
> environments (merging those resource hints they understand and are
> otherwise compatible, and discarding those they don't) for efficient
> execution.
>

+1 the Environment proto is intended as a description of requirements. A
single VM+container configuration can meet many sets of requirements. Some
will need to be unioned (specialized hardware / available libraries) while
others may be additive (RAM, etc) and yet others may be simulated (using
SDK harness vs directly calling a function in process if you can). That's
one reason why we have a structured representation.

Kenn

> I for one am ready to see a PR.
>
> +1
>
> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles  wrote:
> >>
> >>
> >>
> >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw 
> wrote:
> >>>
> >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran <
> mir...@google.com> wrote:
> >>> >
> >>> > Thanks everyone so much for their input and for the insightful
> discussion.
> >>> >
> >>> > Not being knowledgeable about Beam's internals, I have to say I am a
> bit lost on the PTransform vs. environment discussion.
> >>> >
> >>> > I do agree with Burke's notion that merge rules are very annotation
> dependent, I don't think we can find a one-size-fits-all solution for that.
> So this might be actually be an argument in favour of having annotations on
> PTransforms, since it avoids the conflation with environments.
> >>> >
> >>> > Also in general, I feel that having annotations per single transform
> (rather than composite) and on PTransforms could lead to a simpler design.
> >>>
> >>> If we want to use these for privacy, I don't see how attaching them to
> >>> leaf 

Re: PTransform Annotations Proposal

2020-11-30 Thread Mirac Vuslat Basaran
Created a PR without unit tests at https://github.com/apache/beam/pull/13434. 
Please have a look.

Thanks,
Mirac


On 2020/11/25 18:50:19, Robert Burke  wrote: 
> Hmmm. Fair. I'm mostly concerned about the pathological case where we end
> up with a distinct Environment per transform, but there are likely
> practical cases where that's reasonable (High mem to GPU to TPU, to ARM)
> 
> On Wed, Nov 25, 2020, 10:42 AM Robert Bradshaw  wrote:
> 
> > I'd like to continue the discussion *and* see an implementation for
> > the part we've settled on. I was asking why not have "every distinct
> > physical concern means a distinct environment?"
> >
> > On Wed, Nov 25, 2020 at 10:38 AM Robert Burke  wrote:
> > >
> > > Mostly because perfect is the enemy of good enough. We have a proposal,
> > we have clear boundaries for it. It's fine if the discussion continues, but
> > I see no evidence of concerns that should prevent starting an
> > implementation, because it seems we'll need both anyway.
> > >
> > > On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw 
> > wrote:
> > >>
> > >> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke 
> > wrote:
> > >> >
> > >> > It sounds like we've come to the position that non-correctness
> > affecting Ptransform Annotations are valuable at both leaf and composite
> > levels, and don't remove the potential need for a similar feature on
> > Environments, to handle physical concerns equirements for worker processes
> > to have (such as Ram, CPU, or GPU requirements.)
> > >> >
> > >> > Kenn, it's not clear what part of the solution (an annotation field
> > on the Ptransform proto message) would need to change to satisfy your scope
> > concern, beyond documenting unambiguously that these may not be used for
> > physical concerns or things that affect correctness.
> > >>
> > >> I'll let Kenn answer as well, but from my point of view, explicitly
> > >> having somewhere better to put these things would help.
> > >>
> > >> > I'm also unclear your scope concern not matching, given the above.
> > Your first paragraph reads very supportive of logical annotations on
> > Ptransforms, and that matches 1-1 with the current proposed solution. Can
> > you clarify your concern?
> > >> >
> > >> > I don't wish to scope creep on the physical requirements issue at
> > this time. It seems we are agreed they should end up on environments, but
> > I'm not seeing proposals on the right way to execute them at this time.They
> > seem to be a fruitful topic of discussion, in particular
> > unifying/consolidating them for efficient use of resources. I don't think
> > we want to end up in a state where every distinct physical concern means a
> > distinct environment.
> > >>
> > >> Why not? Assuming, of course, that runners are free to merge
> > >> environments (merging those resource hints they understand and are
> > >> otherwise compatible, and discarding those they don't) for efficient
> > >> execution.
> > >>
> > >> > I for one am ready to see a PR.
> > >>
> > >> +1
> > >>
> > >> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles 
> > wrote:
> > >> >>
> > >> >>
> > >> >>
> > >> >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw 
> > wrote:
> > >> >>>
> > >> >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran <
> > mir...@google.com> wrote:
> > >> >>> >
> > >> >>> > Thanks everyone so much for their input and for the insightful
> > discussion.
> > >> >>> >
> > >> >>> > Not being knowledgeable about Beam's internals, I have to say I
> > am a bit lost on the PTransform vs. environment discussion.
> > >> >>> >
> > >> >>> > I do agree with Burke's notion that merge rules are very
> > annotation dependent, I don't think we can find a one-size-fits-all
> > solution for that. So this might be actually be an argument in favour of
> > having annotations on PTransforms, since it avoids the conflation with
> > environments.
> > >> >>> >
> > >> >>> > Also in general, I feel that having annotations per single
> > transform (rather than composite) and on PTransforms could lead to a
> > simpler design.
> > >> >>>
> > >> >>> If we want to use these for privacy, I don't see how attaching them
> > to
> > >> >>> leaf transforms alone could work. (Even CombinePerKey is a
> > composite.)
> > >> >>>
> > >> >>> > Seeing as there are valuable arguments in favour of both
> > (PTransform and environments) with no clear(?) "best solution", I would
> > propose moving forward with the initial (PTransform) design to ship the
> > feature and unblock teams asking for it. If it turns out that there was
> > indeed a need to have annotations in environments, we could always refactor
> > it.
> > >> >>>
> > >> >>> I have yet to see any arguments that resource-level hints, such as
> > >> >>> memory or GPU, don't better belong on the environment. But moving
> > >> >>> forward on PTransform-level ones for logical statements like privacy
> > >> >>> declarations makes sense.
> > >> >>
> > >> >>
> > >> >> Exactly this. Properties of transforms make sense. 

Re: PTransform Annotations Proposal

2020-11-25 Thread Robert Burke
Hmmm. Fair. I'm mostly concerned about the pathological case where we end
up with a distinct Environment per transform, but there are likely
practical cases where that's reasonable (High mem to GPU to TPU, to ARM)

On Wed, Nov 25, 2020, 10:42 AM Robert Bradshaw  wrote:

> I'd like to continue the discussion *and* see an implementation for
> the part we've settled on. I was asking why not have "every distinct
> physical concern means a distinct environment?"
>
> On Wed, Nov 25, 2020 at 10:38 AM Robert Burke  wrote:
> >
> > Mostly because perfect is the enemy of good enough. We have a proposal,
> we have clear boundaries for it. It's fine if the discussion continues, but
> I see no evidence of concerns that should prevent starting an
> implementation, because it seems we'll need both anyway.
> >
> > On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw 
> wrote:
> >>
> >> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke 
> wrote:
> >> >
> >> > It sounds like we've come to the position that non-correctness
> affecting Ptransform Annotations are valuable at both leaf and composite
> levels, and don't remove the potential need for a similar feature on
> Environments, to handle physical concerns equirements for worker processes
> to have (such as Ram, CPU, or GPU requirements.)
> >> >
> >> > Kenn, it's not clear what part of the solution (an annotation field
> on the Ptransform proto message) would need to change to satisfy your scope
> concern, beyond documenting unambiguously that these may not be used for
> physical concerns or things that affect correctness.
> >>
> >> I'll let Kenn answer as well, but from my point of view, explicitly
> >> having somewhere better to put these things would help.
> >>
> >> > I'm also unclear your scope concern not matching, given the above.
> Your first paragraph reads very supportive of logical annotations on
> Ptransforms, and that matches 1-1 with the current proposed solution. Can
> you clarify your concern?
> >> >
> >> > I don't wish to scope creep on the physical requirements issue at
> this time. It seems we are agreed they should end up on environments, but
> I'm not seeing proposals on the right way to execute them at this time.They
> seem to be a fruitful topic of discussion, in particular
> unifying/consolidating them for efficient use of resources. I don't think
> we want to end up in a state where every distinct physical concern means a
> distinct environment.
> >>
> >> Why not? Assuming, of course, that runners are free to merge
> >> environments (merging those resource hints they understand and are
> >> otherwise compatible, and discarding those they don't) for efficient
> >> execution.
> >>
> >> > I for one am ready to see a PR.
> >>
> >> +1
> >>
> >> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles 
> wrote:
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw 
> wrote:
> >> >>>
> >> >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran <
> mir...@google.com> wrote:
> >> >>> >
> >> >>> > Thanks everyone so much for their input and for the insightful
> discussion.
> >> >>> >
> >> >>> > Not being knowledgeable about Beam's internals, I have to say I
> am a bit lost on the PTransform vs. environment discussion.
> >> >>> >
> >> >>> > I do agree with Burke's notion that merge rules are very
> annotation dependent, I don't think we can find a one-size-fits-all
> solution for that. So this might be actually be an argument in favour of
> having annotations on PTransforms, since it avoids the conflation with
> environments.
> >> >>> >
> >> >>> > Also in general, I feel that having annotations per single
> transform (rather than composite) and on PTransforms could lead to a
> simpler design.
> >> >>>
> >> >>> If we want to use these for privacy, I don't see how attaching them
> to
> >> >>> leaf transforms alone could work. (Even CombinePerKey is a
> composite.)
> >> >>>
> >> >>> > Seeing as there are valuable arguments in favour of both
> (PTransform and environments) with no clear(?) "best solution", I would
> propose moving forward with the initial (PTransform) design to ship the
> feature and unblock teams asking for it. If it turns out that there was
> indeed a need to have annotations in environments, we could always refactor
> it.
> >> >>>
> >> >>> I have yet to see any arguments that resource-level hints, such as
> >> >>> memory or GPU, don't better belong on the environment. But moving
> >> >>> forward on PTransform-level ones for logical statements like privacy
> >> >>> declarations makes sense.
> >> >>
> >> >>
> >> >> Exactly this. Properties of transforms make sense. The properties
> may hold only of the whole subgraph. Even something as simple as "preserves
> keys". This is analogous (but converse) to requirements like "requires
> sorted input" which were explicitly excluded from the scope, which was
> about hardware environment for execution.
> >> >>
> >> >> The proposed scope and the proposed solution do not match and need
> to be 

Re: PTransform Annotations Proposal

2020-11-25 Thread Robert Bradshaw
I'd like to continue the discussion *and* see an implementation for
the part we've settled on. I was asking why not have "every distinct
physical concern means a distinct environment?"

On Wed, Nov 25, 2020 at 10:38 AM Robert Burke  wrote:
>
> Mostly because perfect is the enemy of good enough. We have a proposal, we 
> have clear boundaries for it. It's fine if the discussion continues, but I 
> see no evidence of concerns that should prevent starting an implementation, 
> because it seems we'll need both anyway.
>
> On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw  wrote:
>>
>> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke  wrote:
>> >
>> > It sounds like we've come to the position that non-correctness affecting 
>> > Ptransform Annotations are valuable at both leaf and composite levels, and 
>> > don't remove the potential need for a similar feature on Environments, to 
>> > handle physical concerns equirements for worker processes to have (such as 
>> > Ram, CPU, or GPU requirements.)
>> >
>> > Kenn, it's not clear what part of the solution (an annotation field on the 
>> > Ptransform proto message) would need to change to satisfy your scope 
>> > concern, beyond documenting unambiguously that these may not be used for 
>> > physical concerns or things that affect correctness.
>>
>> I'll let Kenn answer as well, but from my point of view, explicitly
>> having somewhere better to put these things would help.
>>
>> > I'm also unclear your scope concern not matching, given the above. Your 
>> > first paragraph reads very supportive of logical annotations on 
>> > Ptransforms, and that matches 1-1 with the current proposed solution. Can 
>> > you clarify your concern?
>> >
>> > I don't wish to scope creep on the physical requirements issue at this 
>> > time. It seems we are agreed they should end up on environments, but I'm 
>> > not seeing proposals on the right way to execute them at this time.They 
>> > seem to be a fruitful topic of discussion, in particular 
>> > unifying/consolidating them for efficient use of resources. I don't think 
>> > we want to end up in a state where every distinct physical concern means a 
>> > distinct environment.
>>
>> Why not? Assuming, of course, that runners are free to merge
>> environments (merging those resource hints they understand and are
>> otherwise compatible, and discarding those they don't) for efficient
>> execution.
>>
>> > I for one am ready to see a PR.
>>
>> +1
>>
>> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles  wrote:
>> >>
>> >>
>> >>
>> >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw  
>> >> wrote:
>> >>>
>> >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran 
>> >>>  wrote:
>> >>> >
>> >>> > Thanks everyone so much for their input and for the insightful 
>> >>> > discussion.
>> >>> >
>> >>> > Not being knowledgeable about Beam's internals, I have to say I am a 
>> >>> > bit lost on the PTransform vs. environment discussion.
>> >>> >
>> >>> > I do agree with Burke's notion that merge rules are very annotation 
>> >>> > dependent, I don't think we can find a one-size-fits-all solution for 
>> >>> > that. So this might be actually be an argument in favour of having 
>> >>> > annotations on PTransforms, since it avoids the conflation with 
>> >>> > environments.
>> >>> >
>> >>> > Also in general, I feel that having annotations per single transform 
>> >>> > (rather than composite) and on PTransforms could lead to a simpler 
>> >>> > design.
>> >>>
>> >>> If we want to use these for privacy, I don't see how attaching them to
>> >>> leaf transforms alone could work. (Even CombinePerKey is a composite.)
>> >>>
>> >>> > Seeing as there are valuable arguments in favour of both (PTransform 
>> >>> > and environments) with no clear(?) "best solution", I would propose 
>> >>> > moving forward with the initial (PTransform) design to ship the 
>> >>> > feature and unblock teams asking for it. If it turns out that there 
>> >>> > was indeed a need to have annotations in environments, we could always 
>> >>> > refactor it.
>> >>>
>> >>> I have yet to see any arguments that resource-level hints, such as
>> >>> memory or GPU, don't better belong on the environment. But moving
>> >>> forward on PTransform-level ones for logical statements like privacy
>> >>> declarations makes sense.
>> >>
>> >>
>> >> Exactly this. Properties of transforms make sense. The properties may 
>> >> hold only of the whole subgraph. Even something as simple as "preserves 
>> >> keys". This is analogous (but converse) to requirements like "requires 
>> >> sorted input" which were explicitly excluded from the scope, which was 
>> >> about hardware environment for execution.
>> >>
>> >> The proposed scope and the proposed solution do not match and need to be 
>> >> reconciled.
>> >>
>> >> Kenn
>> >>
>> >>> > On 2020/11/17 19:07:22, Robert Bradshaw  wrote:
>> >>> > > So far we have two distinct usecases for annotations: resource hints
>> >>> > > and privacy directives, and I've been 

Re: PTransform Annotations Proposal

2020-11-25 Thread Robert Burke
Mostly because perfect is the enemy of good enough. We have a proposal, we
have clear boundaries for it. It's fine if the discussion continues, but I
see no evidence of concerns that should prevent starting an implementation,
because it seems we'll need both anyway.

On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw  wrote:

> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke  wrote:
> >
> > It sounds like we've come to the position that non-correctness affecting
> Ptransform Annotations are valuable at both leaf and composite levels, and
> don't remove the potential need for a similar feature on Environments, to
> handle physical concerns equirements for worker processes to have (such as
> Ram, CPU, or GPU requirements.)
> >
> > Kenn, it's not clear what part of the solution (an annotation field on
> the Ptransform proto message) would need to change to satisfy your scope
> concern, beyond documenting unambiguously that these may not be used for
> physical concerns or things that affect correctness.
>
> I'll let Kenn answer as well, but from my point of view, explicitly
> having somewhere better to put these things would help.
>
> > I'm also unclear your scope concern not matching, given the above. Your
> first paragraph reads very supportive of logical annotations on
> Ptransforms, and that matches 1-1 with the current proposed solution. Can
> you clarify your concern?
> >
> > I don't wish to scope creep on the physical requirements issue at this
> time. It seems we are agreed they should end up on environments, but I'm
> not seeing proposals on the right way to execute them at this time.They
> seem to be a fruitful topic of discussion, in particular
> unifying/consolidating them for efficient use of resources. I don't think
> we want to end up in a state where every distinct physical concern means a
> distinct environment.
>
> Why not? Assuming, of course, that runners are free to merge
> environments (merging those resource hints they understand and are
> otherwise compatible, and discarding those they don't) for efficient
> execution.
>
> > I for one am ready to see a PR.
>
> +1
>
> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles  wrote:
> >>
> >>
> >>
> >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw 
> wrote:
> >>>
> >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran <
> mir...@google.com> wrote:
> >>> >
> >>> > Thanks everyone so much for their input and for the insightful
> discussion.
> >>> >
> >>> > Not being knowledgeable about Beam's internals, I have to say I am a
> bit lost on the PTransform vs. environment discussion.
> >>> >
> >>> > I do agree with Burke's notion that merge rules are very annotation
> dependent, I don't think we can find a one-size-fits-all solution for that.
> So this might be actually be an argument in favour of having annotations on
> PTransforms, since it avoids the conflation with environments.
> >>> >
> >>> > Also in general, I feel that having annotations per single transform
> (rather than composite) and on PTransforms could lead to a simpler design.
> >>>
> >>> If we want to use these for privacy, I don't see how attaching them to
> >>> leaf transforms alone could work. (Even CombinePerKey is a composite.)
> >>>
> >>> > Seeing as there are valuable arguments in favour of both (PTransform
> and environments) with no clear(?) "best solution", I would propose moving
> forward with the initial (PTransform) design to ship the feature and
> unblock teams asking for it. If it turns out that there was indeed a need
> to have annotations in environments, we could always refactor it.
> >>>
> >>> I have yet to see any arguments that resource-level hints, such as
> >>> memory or GPU, don't better belong on the environment. But moving
> >>> forward on PTransform-level ones for logical statements like privacy
> >>> declarations makes sense.
> >>
> >>
> >> Exactly this. Properties of transforms make sense. The properties may
> hold only of the whole subgraph. Even something as simple as "preserves
> keys". This is analogous (but converse) to requirements like "requires
> sorted input" which were explicitly excluded from the scope, which was
> about hardware environment for execution.
> >>
> >> The proposed scope and the proposed solution do not match and need to
> be reconciled.
> >>
> >> Kenn
> >>
> >>> > On 2020/11/17 19:07:22, Robert Bradshaw  wrote:
> >>> > > So far we have two distinct usecases for annotations: resource
> hints
> >>> > > and privacy directives, and I've been trying to figure out how to
> >>> > > reconcile them, but they seem to have very different
> characteristics.
> >>> > > (It would be nice to come up with other uses as well to see if
> we're
> >>> > > really coming up with a generally useful mode--I think display data
> >>> > > could fit into this as a new kind of annotation rather than being a
> >>> > > top-level property, and it could make sense on both leaf and
> composite
> >>> > > transforms.)
> >>> > >
> >>> > > To me, resource hints like GPU 

Re: PTransform Annotations Proposal

2020-11-25 Thread Robert Bradshaw
On Wed, Nov 25, 2020 at 10:15 AM Robert Burke  wrote:
>
> It sounds like we've come to the position that non-correctness affecting 
> Ptransform Annotations are valuable at both leaf and composite levels, and 
> don't remove the potential need for a similar feature on Environments, to 
> handle physical concerns equirements for worker processes to have (such as 
> Ram, CPU, or GPU requirements.)
>
> Kenn, it's not clear what part of the solution (an annotation field on the 
> Ptransform proto message) would need to change to satisfy your scope concern, 
> beyond documenting unambiguously that these may not be used for physical 
> concerns or things that affect correctness.

I'll let Kenn answer as well, but from my point of view, explicitly
having somewhere better to put these things would help.

> I'm also unclear your scope concern not matching, given the above. Your first 
> paragraph reads very supportive of logical annotations on Ptransforms, and 
> that matches 1-1 with the current proposed solution. Can you clarify your 
> concern?
>
> I don't wish to scope creep on the physical requirements issue at this time. 
> It seems we are agreed they should end up on environments, but I'm not seeing 
> proposals on the right way to execute them at this time.They seem to be a 
> fruitful topic of discussion, in particular unifying/consolidating them for 
> efficient use of resources. I don't think we want to end up in a state where 
> every distinct physical concern means a distinct environment.

Why not? Assuming, of course, that runners are free to merge
environments (merging those resource hints they understand and are
otherwise compatible, and discarding those they don't) for efficient
execution.

> I for one am ready to see a PR.

+1

> On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles  wrote:
>>
>>
>>
>> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw  wrote:
>>>
>>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran  
>>> wrote:
>>> >
>>> > Thanks everyone so much for their input and for the insightful discussion.
>>> >
>>> > Not being knowledgeable about Beam's internals, I have to say I am a bit 
>>> > lost on the PTransform vs. environment discussion.
>>> >
>>> > I do agree with Burke's notion that merge rules are very annotation 
>>> > dependent, I don't think we can find a one-size-fits-all solution for 
>>> > that. So this might be actually be an argument in favour of having 
>>> > annotations on PTransforms, since it avoids the conflation with 
>>> > environments.
>>> >
>>> > Also in general, I feel that having annotations per single transform 
>>> > (rather than composite) and on PTransforms could lead to a simpler design.
>>>
>>> If we want to use these for privacy, I don't see how attaching them to
>>> leaf transforms alone could work. (Even CombinePerKey is a composite.)
>>>
>>> > Seeing as there are valuable arguments in favour of both (PTransform and 
>>> > environments) with no clear(?) "best solution", I would propose moving 
>>> > forward with the initial (PTransform) design to ship the feature and 
>>> > unblock teams asking for it. If it turns out that there was indeed a need 
>>> > to have annotations in environments, we could always refactor it.
>>>
>>> I have yet to see any arguments that resource-level hints, such as
>>> memory or GPU, don't better belong on the environment. But moving
>>> forward on PTransform-level ones for logical statements like privacy
>>> declarations makes sense.
>>
>>
>> Exactly this. Properties of transforms make sense. The properties may hold 
>> only of the whole subgraph. Even something as simple as "preserves keys". 
>> This is analogous (but converse) to requirements like "requires sorted 
>> input" which were explicitly excluded from the scope, which was about 
>> hardware environment for execution.
>>
>> The proposed scope and the proposed solution do not match and need to be 
>> reconciled.
>>
>> Kenn
>>
>>> > On 2020/11/17 19:07:22, Robert Bradshaw  wrote:
>>> > > So far we have two distinct usecases for annotations: resource hints
>>> > > and privacy directives, and I've been trying to figure out how to
>>> > > reconcile them, but they seem to have very different characteristics.
>>> > > (It would be nice to come up with other uses as well to see if we're
>>> > > really coming up with a generally useful mode--I think display data
>>> > > could fit into this as a new kind of annotation rather than being a
>>> > > top-level property, and it could make sense on both leaf and composite
>>> > > transforms.)
>>> > >
>>> > > To me, resource hints like GPU are inextricably tied to the
>>> > > environment. A transform tagged with GPU should reference a Fn that
>>> > > invokes GPU-accelerated code that lives in a particular environment.
>>> > > Something like high-mem is a bit squishier. Some DoFns take a lot of
>>> > > memory, but on the other hand one could imagine labeling a CoGBK as
>>> > > high-mem due to knowing that, in this particular usage, there 

Re: PTransform Annotations Proposal

2020-11-25 Thread Robert Burke
It sounds like we've come to the position that non-correctness affecting
Ptransform Annotations are valuable at both leaf and composite levels, and
don't remove the potential need for a similar feature on Environments, to
handle physical concerns equirements for worker processes to have (such as
Ram, CPU, or GPU requirements.)

Kenn, it's not clear what part of the solution (an annotation field on the
Ptransform proto message) would need to change to satisfy your scope
concern, beyond documenting unambiguously that these may not be used for
physical concerns or things that affect correctness. I'm also unclear your
scope concern not matching, given the above. Your first paragraph reads
very supportive of logical annotations on Ptransforms, and that matches 1-1
with the current proposed solution. Can you clarify your concern?

I don't wish to scope creep on the physical requirements issue at this
time. It seems we are agreed they should end up on environments, but I'm
not seeing proposals on the right way to execute them at this time.They
seem to be a fruitful topic of discussion, in particular
unifying/consolidating them for efficient use of resources. I don't think
we want to end up in a state where every distinct physical concern means a
distinct environment.

I for one am ready to see a PR.


On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles  wrote:

>
>
> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw 
> wrote:
>
>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran 
>> wrote:
>> >
>> > Thanks everyone so much for their input and for the insightful
>> discussion.
>> >
>> > Not being knowledgeable about Beam's internals, I have to say I am a
>> bit lost on the PTransform vs. environment discussion.
>> >
>> > I do agree with Burke's notion that merge rules are very annotation
>> dependent, I don't think we can find a one-size-fits-all solution for that.
>> So this might be actually be an argument in favour of having annotations on
>> PTransforms, since it avoids the conflation with environments.
>> >
>> > Also in general, I feel that having annotations per single transform
>> (rather than composite) and on PTransforms could lead to a simpler design.
>>
>> If we want to use these for privacy, I don't see how attaching them to
>> leaf transforms alone could work. (Even CombinePerKey is a composite.)
>>
>> > Seeing as there are valuable arguments in favour of both (PTransform
>> and environments) with no clear(?) "best solution", I would propose moving
>> forward with the initial (PTransform) design to ship the feature and
>> unblock teams asking for it. If it turns out that there was indeed a need
>> to have annotations in environments, we could always refactor it.
>>
>> I have yet to see any arguments that resource-level hints, such as
>> memory or GPU, don't better belong on the environment. But moving
>> forward on PTransform-level ones for logical statements like privacy
>> declarations makes sense.
>>
>
> Exactly this. Properties of transforms make sense. The properties may hold
> only of the whole subgraph. Even something as simple as "preserves keys".
> This is analogous (but converse) to requirements like "requires sorted
> input" which were explicitly excluded from the scope, which was about
> hardware environment for execution.
>
> The proposed scope and the proposed solution do not match and need to be
> reconciled.
>
> Kenn
>
> > On 2020/11/17 19:07:22, Robert Bradshaw  wrote:
>> > > So far we have two distinct usecases for annotations: resource hints
>> > > and privacy directives, and I've been trying to figure out how to
>> > > reconcile them, but they seem to have very different characteristics.
>> > > (It would be nice to come up with other uses as well to see if we're
>> > > really coming up with a generally useful mode--I think display data
>> > > could fit into this as a new kind of annotation rather than being a
>> > > top-level property, and it could make sense on both leaf and composite
>> > > transforms.)
>> > >
>> > > To me, resource hints like GPU are inextricably tied to the
>> > > environment. A transform tagged with GPU should reference a Fn that
>> > > invokes GPU-accelerated code that lives in a particular environment.
>> > > Something like high-mem is a bit squishier. Some DoFns take a lot of
>> > > memory, but on the other hand one could imagine labeling a CoGBK as
>> > > high-mem due to knowing that, in this particular usage, there will be
>> > > lots of values with the same key. Ideally runners would be intelligent
>> > > enough to automatically learn memory usage, but even in this case it
>> > > may be a good hint to try and learn the requirements for DoFn A and
>> > > DoFn B separately (which is difficult if they are always colocated,
>> > > but valuable if, e.g. A takes a huge amount of memory and B takes a
>> > > huge amount of wall time).
>> > >
>> > > Note that tying things to the environment does not preclude using them
>> > > in non-portable runners as they'll 

Re: PTransform Annotations Proposal

2020-11-23 Thread Kenneth Knowles
On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw  wrote:

> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran 
> wrote:
> >
> > Thanks everyone so much for their input and for the insightful
> discussion.
> >
> > Not being knowledgeable about Beam's internals, I have to say I am a bit
> lost on the PTransform vs. environment discussion.
> >
> > I do agree with Burke's notion that merge rules are very annotation
> dependent, I don't think we can find a one-size-fits-all solution for that.
> So this might be actually be an argument in favour of having annotations on
> PTransforms, since it avoids the conflation with environments.
> >
> > Also in general, I feel that having annotations per single transform
> (rather than composite) and on PTransforms could lead to a simpler design.
>
> If we want to use these for privacy, I don't see how attaching them to
> leaf transforms alone could work. (Even CombinePerKey is a composite.)
>
> > Seeing as there are valuable arguments in favour of both (PTransform and
> environments) with no clear(?) "best solution", I would propose moving
> forward with the initial (PTransform) design to ship the feature and
> unblock teams asking for it. If it turns out that there was indeed a need
> to have annotations in environments, we could always refactor it.
>
> I have yet to see any arguments that resource-level hints, such as
> memory or GPU, don't better belong on the environment. But moving
> forward on PTransform-level ones for logical statements like privacy
> declarations makes sense.
>

Exactly this. Properties of transforms make sense. The properties may hold
only of the whole subgraph. Even something as simple as "preserves keys".
This is analogous (but converse) to requirements like "requires sorted
input" which were explicitly excluded from the scope, which was about
hardware environment for execution.

The proposed scope and the proposed solution do not match and need to be
reconciled.

Kenn

> On 2020/11/17 19:07:22, Robert Bradshaw  wrote:
> > > So far we have two distinct usecases for annotations: resource hints
> > > and privacy directives, and I've been trying to figure out how to
> > > reconcile them, but they seem to have very different characteristics.
> > > (It would be nice to come up with other uses as well to see if we're
> > > really coming up with a generally useful mode--I think display data
> > > could fit into this as a new kind of annotation rather than being a
> > > top-level property, and it could make sense on both leaf and composite
> > > transforms.)
> > >
> > > To me, resource hints like GPU are inextricably tied to the
> > > environment. A transform tagged with GPU should reference a Fn that
> > > invokes GPU-accelerated code that lives in a particular environment.
> > > Something like high-mem is a bit squishier. Some DoFns take a lot of
> > > memory, but on the other hand one could imagine labeling a CoGBK as
> > > high-mem due to knowing that, in this particular usage, there will be
> > > lots of values with the same key. Ideally runners would be intelligent
> > > enough to automatically learn memory usage, but even in this case it
> > > may be a good hint to try and learn the requirements for DoFn A and
> > > DoFn B separately (which is difficult if they are always colocated,
> > > but valuable if, e.g. A takes a huge amount of memory and B takes a
> > > huge amount of wall time).
> > >
> > > Note that tying things to the environment does not preclude using them
> > > in non-portable runners as they'll still have an SDK-level
> > > representation (though I don't think we should have an explicit goal
> > > of feature parity for non-portable runners, e.g. multi-language isn't
> > > happening, and hope that non-portable runners go away soon anyway).
> > >
> > > Now let's consider privacy annotations. To make things very concrete,
> > > imagine a transform AverageSpendPerZipCode which takes as input (user,
> > > zip, spend), all users unique, and returns (zip, avg(spend)). In
> > > Python, this is GroupBy('zip').aggregate_field('spend',
> > > MeanCombineFn()). This is not very privacy preserving to those users
> > > who are the only (or one of a few) in a zip code. So we could define a
> > > transform PrivacyPreservingAverageSpendPerZipCode as
> > >
> > > @ptransform_fn
> > > def PrivacyPreservingAverageSpendPerZipCode(spend_per_user, threshold)
> > > counts_per_zip = spend_per_user |
> > > GroupBy('zip').aggregate_field('user', CountCombineFn())
> > > spend_per_zip = spend_per_user |
> > > GroupBy('zip').aggregate_field('spend', MeanCombineFn())
> > > filtered = spend_per_zip | beam.Filter(
> > > lambda x, counts: counts[x.zip] > threshold,
> > > counts=AsMap(counts_per_zip))
> > > return filtered
> > >
> > > We now have a composite that has privacy preserving properties (i.e.
> > > the input may be quite sensitive, but the output is not, depending on
> > > the value of threshold). What is interesting here is that 

Re: PTransform Annotations Proposal

2020-11-23 Thread Robert Bradshaw
On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran  wrote:
>
> Thanks everyone so much for their input and for the insightful discussion.
>
> Not being knowledgeable about Beam's internals, I have to say I am a bit lost 
> on the PTransform vs. environment discussion.
>
> I do agree with Burke's notion that merge rules are very annotation 
> dependent, I don't think we can find a one-size-fits-all solution for that. 
> So this might be actually be an argument in favour of having annotations on 
> PTransforms, since it avoids the conflation with environments.
>
> Also in general, I feel that having annotations per single transform (rather 
> than composite) and on PTransforms could lead to a simpler design.

If we want to use these for privacy, I don't see how attaching them to
leaf transforms alone could work. (Even CombinePerKey is a composite.)

> Seeing as there are valuable arguments in favour of both (PTransform and 
> environments) with no clear(?) "best solution", I would propose moving 
> forward with the initial (PTransform) design to ship the feature and unblock 
> teams asking for it. If it turns out that there was indeed a need to have 
> annotations in environments, we could always refactor it.

I have yet to see any arguments that resource-level hints, such as
memory or GPU, don't better belong on the environment. But moving
forward on PTransform-level ones for logical statements like privacy
declarations makes sense.

> On 2020/11/17 19:07:22, Robert Bradshaw  wrote:
> > So far we have two distinct usecases for annotations: resource hints
> > and privacy directives, and I've been trying to figure out how to
> > reconcile them, but they seem to have very different characteristics.
> > (It would be nice to come up with other uses as well to see if we're
> > really coming up with a generally useful mode--I think display data
> > could fit into this as a new kind of annotation rather than being a
> > top-level property, and it could make sense on both leaf and composite
> > transforms.)
> >
> > To me, resource hints like GPU are inextricably tied to the
> > environment. A transform tagged with GPU should reference a Fn that
> > invokes GPU-accelerated code that lives in a particular environment.
> > Something like high-mem is a bit squishier. Some DoFns take a lot of
> > memory, but on the other hand one could imagine labeling a CoGBK as
> > high-mem due to knowing that, in this particular usage, there will be
> > lots of values with the same key. Ideally runners would be intelligent
> > enough to automatically learn memory usage, but even in this case it
> > may be a good hint to try and learn the requirements for DoFn A and
> > DoFn B separately (which is difficult if they are always colocated,
> > but valuable if, e.g. A takes a huge amount of memory and B takes a
> > huge amount of wall time).
> >
> > Note that tying things to the environment does not preclude using them
> > in non-portable runners as they'll still have an SDK-level
> > representation (though I don't think we should have an explicit goal
> > of feature parity for non-portable runners, e.g. multi-language isn't
> > happening, and hope that non-portable runners go away soon anyway).
> >
> > Now let's consider privacy annotations. To make things very concrete,
> > imagine a transform AverageSpendPerZipCode which takes as input (user,
> > zip, spend), all users unique, and returns (zip, avg(spend)). In
> > Python, this is GroupBy('zip').aggregate_field('spend',
> > MeanCombineFn()). This is not very privacy preserving to those users
> > who are the only (or one of a few) in a zip code. So we could define a
> > transform PrivacyPreservingAverageSpendPerZipCode as
> >
> > @ptransform_fn
> > def PrivacyPreservingAverageSpendPerZipCode(spend_per_user, threshold)
> > counts_per_zip = spend_per_user |
> > GroupBy('zip').aggregate_field('user', CountCombineFn())
> > spend_per_zip = spend_per_user |
> > GroupBy('zip').aggregate_field('spend', MeanCombineFn())
> > filtered = spend_per_zip | beam.Filter(
> > lambda x, counts: counts[x.zip] > threshold,
> > counts=AsMap(counts_per_zip))
> > return filtered
> >
> > We now have a composite that has privacy preserving properties (i.e.
> > the input may be quite sensitive, but the output is not, depending on
> > the value of threshold). What is interesting here is that it is only
> > the composite that has this property--no individual sub-transform is
> > itself privacy preserving. Furthermore, an optimizer may notice we're
> > doing aggregation on the same key twice and rewrite this using
> > (logically)
> >
> > GroupBy('zip').aggregate_field('user',
> > CountCombineFn()).aggregate_field('spend', MeanCombineFn())
> >
> > and then applying the filter, which is semantically equivalent and
> > satisfies the privacy annotations (and notably that does not even
> > require the optimizer to interpret the annotations, just pass them
> > on). To me, this implies that 

Re: PTransform Annotations Proposal

2020-11-20 Thread Mirac Vuslat Basaran
Thanks everyone so much for their input and for the insightful discussion.

Not being knowledgeable about Beam's internals, I have to say I am a bit lost 
on the PTransform vs. environment discussion.

I do agree with Burke's notion that merge rules are very annotation dependent, 
I don't think we can find a one-size-fits-all solution for that. So this might 
be actually be an argument in favour of having annotations on PTransforms, 
since it avoids the conflation with environments.

Also in general, I feel that having annotations per single transform (rather 
than composite) and on PTransforms could lead to a simpler design.

Seeing as there are valuable arguments in favour of both (PTransform and 
environments) with no clear(?) "best solution", I would propose moving forward 
with the initial (PTransform) design to ship the feature and unblock teams 
asking for it. If it turns out that there was indeed a need to have annotations 
in environments, we could always refactor it.

On 2020/11/17 19:07:22, Robert Bradshaw  wrote: 
> So far we have two distinct usecases for annotations: resource hints
> and privacy directives, and I've been trying to figure out how to
> reconcile them, but they seem to have very different characteristics.
> (It would be nice to come up with other uses as well to see if we're
> really coming up with a generally useful mode--I think display data
> could fit into this as a new kind of annotation rather than being a
> top-level property, and it could make sense on both leaf and composite
> transforms.)
> 
> To me, resource hints like GPU are inextricably tied to the
> environment. A transform tagged with GPU should reference a Fn that
> invokes GPU-accelerated code that lives in a particular environment.
> Something like high-mem is a bit squishier. Some DoFns take a lot of
> memory, but on the other hand one could imagine labeling a CoGBK as
> high-mem due to knowing that, in this particular usage, there will be
> lots of values with the same key. Ideally runners would be intelligent
> enough to automatically learn memory usage, but even in this case it
> may be a good hint to try and learn the requirements for DoFn A and
> DoFn B separately (which is difficult if they are always colocated,
> but valuable if, e.g. A takes a huge amount of memory and B takes a
> huge amount of wall time).
> 
> Note that tying things to the environment does not preclude using them
> in non-portable runners as they'll still have an SDK-level
> representation (though I don't think we should have an explicit goal
> of feature parity for non-portable runners, e.g. multi-language isn't
> happening, and hope that non-portable runners go away soon anyway).
> 
> Now let's consider privacy annotations. To make things very concrete,
> imagine a transform AverageSpendPerZipCode which takes as input (user,
> zip, spend), all users unique, and returns (zip, avg(spend)). In
> Python, this is GroupBy('zip').aggregate_field('spend',
> MeanCombineFn()). This is not very privacy preserving to those users
> who are the only (or one of a few) in a zip code. So we could define a
> transform PrivacyPreservingAverageSpendPerZipCode as
> 
> @ptransform_fn
> def PrivacyPreservingAverageSpendPerZipCode(spend_per_user, threshold)
> counts_per_zip = spend_per_user |
> GroupBy('zip').aggregate_field('user', CountCombineFn())
> spend_per_zip = spend_per_user |
> GroupBy('zip').aggregate_field('spend', MeanCombineFn())
> filtered = spend_per_zip | beam.Filter(
> lambda x, counts: counts[x.zip] > threshold,
> counts=AsMap(counts_per_zip))
> return filtered
> 
> We now have a composite that has privacy preserving properties (i.e.
> the input may be quite sensitive, but the output is not, depending on
> the value of threshold). What is interesting here is that it is only
> the composite that has this property--no individual sub-transform is
> itself privacy preserving. Furthermore, an optimizer may notice we're
> doing aggregation on the same key twice and rewrite this using
> (logically)
> 
> GroupBy('zip').aggregate_field('user',
> CountCombineFn()).aggregate_field('spend', MeanCombineFn())
> 
> and then applying the filter, which is semantically equivalent and
> satisfies the privacy annotations (and notably that does not even
> require the optimizer to interpret the annotations, just pass them
> on). To me, this implies that these annotations belong on the
> composites, and *not* on the leaf nodes (where they would be
> incorrect).
> 
> I'll leave aside most questions of API until we figure out the model
> semantics, but wanted to throw one possible idea out (though I am
> ambivalent about it). Instead of attaching things to transforms, we
> can just wrap transforms in composites that have no role other than
> declaring information about their contents. E.g. we could have a
> composite transform whose payload is simply an assertion of the
> privacy (or resource?) properties of its inner structure. 

Re: PTransform Annotations Proposal

2020-11-17 Thread Robert Bradshaw
So far we have two distinct usecases for annotations: resource hints
and privacy directives, and I've been trying to figure out how to
reconcile them, but they seem to have very different characteristics.
(It would be nice to come up with other uses as well to see if we're
really coming up with a generally useful mode--I think display data
could fit into this as a new kind of annotation rather than being a
top-level property, and it could make sense on both leaf and composite
transforms.)

To me, resource hints like GPU are inextricably tied to the
environment. A transform tagged with GPU should reference a Fn that
invokes GPU-accelerated code that lives in a particular environment.
Something like high-mem is a bit squishier. Some DoFns take a lot of
memory, but on the other hand one could imagine labeling a CoGBK as
high-mem due to knowing that, in this particular usage, there will be
lots of values with the same key. Ideally runners would be intelligent
enough to automatically learn memory usage, but even in this case it
may be a good hint to try and learn the requirements for DoFn A and
DoFn B separately (which is difficult if they are always colocated,
but valuable if, e.g. A takes a huge amount of memory and B takes a
huge amount of wall time).

Note that tying things to the environment does not preclude using them
in non-portable runners as they'll still have an SDK-level
representation (though I don't think we should have an explicit goal
of feature parity for non-portable runners, e.g. multi-language isn't
happening, and hope that non-portable runners go away soon anyway).

Now let's consider privacy annotations. To make things very concrete,
imagine a transform AverageSpendPerZipCode which takes as input (user,
zip, spend), all users unique, and returns (zip, avg(spend)). In
Python, this is GroupBy('zip').aggregate_field('spend',
MeanCombineFn()). This is not very privacy preserving to those users
who are the only (or one of a few) in a zip code. So we could define a
transform PrivacyPreservingAverageSpendPerZipCode as

@ptransform_fn
def PrivacyPreservingAverageSpendPerZipCode(spend_per_user, threshold)
counts_per_zip = spend_per_user |
GroupBy('zip').aggregate_field('user', CountCombineFn())
spend_per_zip = spend_per_user |
GroupBy('zip').aggregate_field('spend', MeanCombineFn())
filtered = spend_per_zip | beam.Filter(
lambda x, counts: counts[x.zip] > threshold,
counts=AsMap(counts_per_zip))
return filtered

We now have a composite that has privacy preserving properties (i.e.
the input may be quite sensitive, but the output is not, depending on
the value of threshold). What is interesting here is that it is only
the composite that has this property--no individual sub-transform is
itself privacy preserving. Furthermore, an optimizer may notice we're
doing aggregation on the same key twice and rewrite this using
(logically)

GroupBy('zip').aggregate_field('user',
CountCombineFn()).aggregate_field('spend', MeanCombineFn())

and then applying the filter, which is semantically equivalent and
satisfies the privacy annotations (and notably that does not even
require the optimizer to interpret the annotations, just pass them
on). To me, this implies that these annotations belong on the
composites, and *not* on the leaf nodes (where they would be
incorrect).

I'll leave aside most questions of API until we figure out the model
semantics, but wanted to throw one possible idea out (though I am
ambivalent about it). Instead of attaching things to transforms, we
can just wrap transforms in composites that have no role other than
declaring information about their contents. E.g. we could have a
composite transform whose payload is simply an assertion of the
privacy (or resource?) properties of its inner structure. This would
be just as expressive as adding new properties to transforms
themselves (but would add an extra level of nesting, and make
respecting the precice nesting more important).

On Tue, Nov 17, 2020 at 8:12 AM Robert Burke  wrote:
>
> +1 to discussing PCollection annotations on a separate thread. It would be 
> confusing to mix them up.
>
> ---
>
> The question around conflicts is interesting, but confusing to me. I don't 
> think they exist in general. I keep coming back around to that it depends on 
> the annotation and the purpose of composites. Optionality saves us here too.
>
> Composites are nothing without their internal hypergraph structure. 
> Eventually it comes down to executing the leaf nodes. The alternative to 
> executing the leaf nodes is when the composite represents a known transform 
> and is replaced by the runner on submission time.  Lets look at each.
>
> If there's a property that only exists on the leaf nodes, then it's not 
> possible to bubble up that property to the composite in all cases. Afterall, 
> it's not necessarily the case that a privacy preserving transform maintains 
> the property for all output edges as not all such 

Re: PTransform Annotations Proposal

2020-11-17 Thread Robert Burke
+1 to discussing PCollection annotations on a separate thread. It would be
confusing to mix them up.

---

The question around conflicts is interesting, but confusing to me. I don't
think they exist in general. I keep coming back around to that it depends
on the annotation and the purpose of composites. Optionality saves us here
too.

Composites are nothing without their internal hypergraph structure.
Eventually it comes down to executing the leaf nodes. The alternative to
executing the leaf nodes is when the composite represents a known transform
and is replaced by the runner on submission time.  Lets look at each.

If there's a property that only exists on the leaf nodes, then it's not
possible to bubble up that property to the composite in all cases.
Afterall, it's not necessarily the case that a privacy preserving transform
maintains the property for all output edges as not all such edges pass
through the preserving transform.

On the other hand, with memory or gpu recommendations, that might set a low
bar on the composite level.

But, composites (any transform really) can be runner replaced. I think it's
fair to say that a runner replaced composite is not beholden to the
annotations of the original leaf transforms, especially around physical
requirements. The implementations are different. If a known composite at
the composite level requires GPUs and it's known replacement doesn't, I'd
posit that replacement was a choice the runner made since it can't
provision machines with GPUs.

But, crucially around privacy annotated transforms, a runner likely
shouldn't replace a given subgraph that contains a privacy annotationed
transform unless the replacements provide the same level of privacy.
However, such replacements only happens with well known transforms with
known properties anyway, so this can serve as an additional layer of
validation for a runner aware of the properties.

This brings me back to my position: that the notion of conflicts is very
annotation dependant, and that defining them as optional is the most
important feature to avoid issues. Conflicts don't exist as an inherent
property of annotations on ptransform of the hypergraph structure. Am i
wrong? No one has come up with an actual example of a conflict as far as i
understand the thread.

Even Reuven's original question is more about whether the runner is forced
to look at leaf bodes rather than only looking at the composite. Assuming
the composite isn't replaced, the runner needs to look at the leaf nodes
regardless. And as discussed above there's no generalized semantics that
fit for all kinds of annotations, once replacements are also considered.

On Tue, Nov 17, 2020, 6:35 AM Ismaël Mejía  wrote:

> +1 Nice to see there is finally interest on this. Annotations for
> PTransforms make total sense!
>
> The semantics should be strictly optional for runners and correct
> execution should not be affected by lack of support of any annotation.
> We should however keep the set of annotations small.
>
> > PTransforms are hierarchical - namely a PTransform contains other
> PTransforms, and so on. Is the runner expected to resolve all annotations
> down to leaf nodes? What happens if that results in conflicting annotations?
>
> +1 to this question, This needs to be detailed.
>
> I am curious about how the end user APIs of this will look maybe in
> Java or Python, just an extra method to inject a Map or via Java
> annotations/Python decorators?
>
> We might prefer not to mix the concepts of annotations and
> environments because this will limit the scope of annotations.
> Annotations are different from environments because they serve a more
> general idea: to express an intention and it is up to the runner to
> choose the strategy to accomplish this, for example in the GPU
> assignation case it could be to rewrite resource allocation via
> Environments but it could also just delegate this to a resource
> manager which is what we could do for example for GPU (or data
> locality) cases on the Spark/Flink classic runners. If we tie this to
> environments we will leave classic runners out of the loop for no
> major reason and also not cover use cases not related to resource
> allocation.
>
> I do not understand the use case to justify PCollection annotations
> but to not mix this thread with them, would you be interested to
> elaborate more about them in a different thread Jan?
>
> On Tue, Nov 17, 2020 at 2:28 AM Robert Bradshaw 
> wrote:
> >
> > I agree things like GPU, high-mem, etc. belong to the environment. If
> > annotations are truly advisory, one can imagine merging environments
> > by taking the union of annotations and still producing a correct
> > pipeline. (This would mean that annotations would have to be a
> > multi-map...)
> >
> > On the other hand, this doesn't seem to handle the case of privacy
> > analysis, which could apply to composites without applying to any
> > individual component, and don't really make sense as part 

Re: PTransform Annotations Proposal

2020-11-17 Thread Ismaël Mejía
+1 Nice to see there is finally interest on this. Annotations for
PTransforms make total sense!

The semantics should be strictly optional for runners and correct
execution should not be affected by lack of support of any annotation.
We should however keep the set of annotations small.

> PTransforms are hierarchical - namely a PTransform contains other 
> PTransforms, and so on. Is the runner expected to resolve all annotations 
> down to leaf nodes? What happens if that results in conflicting annotations?

+1 to this question, This needs to be detailed.

I am curious about how the end user APIs of this will look maybe in
Java or Python, just an extra method to inject a Map or via Java
annotations/Python decorators?

We might prefer not to mix the concepts of annotations and
environments because this will limit the scope of annotations.
Annotations are different from environments because they serve a more
general idea: to express an intention and it is up to the runner to
choose the strategy to accomplish this, for example in the GPU
assignation case it could be to rewrite resource allocation via
Environments but it could also just delegate this to a resource
manager which is what we could do for example for GPU (or data
locality) cases on the Spark/Flink classic runners. If we tie this to
environments we will leave classic runners out of the loop for no
major reason and also not cover use cases not related to resource
allocation.

I do not understand the use case to justify PCollection annotations
but to not mix this thread with them, would you be interested to
elaborate more about them in a different thread Jan?

On Tue, Nov 17, 2020 at 2:28 AM Robert Bradshaw  wrote:
>
> I agree things like GPU, high-mem, etc. belong to the environment. If
> annotations are truly advisory, one can imagine merging environments
> by taking the union of annotations and still producing a correct
> pipeline. (This would mean that annotations would have to be a
> multi-map...)
>
> On the other hand, this doesn't seem to handle the case of privacy
> analysis, which could apply to composites without applying to any
> individual component, and don't really make sense as part of a
> fusion/execution story.
>
> On Mon, Nov 16, 2020 at 4:00 PM Robert Burke  wrote:
> >
> > That's good historical context.
> >
> > But then we'd still need to codify the annotation would need to be 
> > optional, and not affect correctness.
> >
> > Conflicts become easier to manage, (as environments with conflicting 
> > annotations simply don't get merged, and stay as distinct environments) but 
> > are still notionally annotation dependant. Do most runners handle 
> > environments so individually though?
> >
> > Reuven's question is a good one though. For the Go SDK, and the proposed 
> > implementation i saw, they only applied to leaf nodes. This is an artifact 
> > of how the Go SDK handles composites. Nothing stops it from being 
> > implemented on the composites Go has, but it didn't make sense otherwise. 
> > AFAICT Composites are generally for organizational convenience and not for 
> > functional aspects. Is this wrong? Afterall, does it make sense for 
> > environments to be on leaves and composites either? It's the same issue 
> > there.
> >
> >
> > On Mon, Nov 16, 2020, 3:45 PM Kenneth Knowles  wrote:
> >>
> >> I am +1 to the proposal but believe it should be moved to the Environment. 
> >> I could be convinced otherwise, but would want to really understand the 
> >> details.
> >>
> >> I think we haven't done a great job communicating the purpose of the 
> >> Environment proto. It was explicitly created for this purpose.
> >>
> >> 1. It tells the runner things it needs to know to interpret the DoFn (or 
> >> other UDF). So these are the existing proto fields like docker image (in 
> >> the payload) and required artifacts that were staged.
> >> 2. It is also the place for additional requirements or hints like "high 
> >> mem" or "GPU" etc.
> >>
> >> Every user function has an associated environment, and environments exist 
> >> only for the purpose of executing user functions. In fact, Environment 
> >> originated as inline requirements/attributes for a user function proto 
> >> message and was separated just to make the proto smaller.
> >>
> >> A PTransform is an abstract concept for organizing the graph, not an 
> >> executable thing. So a hint/capability/requirement on a PTransform only 
> >> really makes sense as a scoping mechanism for applying a hint to a bunch 
> >> of functions within a subgraph. This seems like a user interface concern 
> >> and the SDK should own propagating the hints. If the hint truly applies to 
> >> the whole PTransform and *not* the parts, then I am interested in learning 
> >> about that.
> >>
> >> Kenn
> >>
> >> On Mon, Nov 16, 2020 at 10:54 AM Robert Burke  wrote:
> >>>
> >>> That's a good question.
> >>>
> >>> I think the main difference is a matter of scope. Annotations would apply 
> >>> to a PTransform 

Re: PTransform Annotations Proposal

2020-11-16 Thread Robert Bradshaw
I agree things like GPU, high-mem, etc. belong to the environment. If
annotations are truly advisory, one can imagine merging environments
by taking the union of annotations and still producing a correct
pipeline. (This would mean that annotations would have to be a
multi-map...)

On the other hand, this doesn't seem to handle the case of privacy
analysis, which could apply to composites without applying to any
individual component, and don't really make sense as part of a
fusion/execution story.

On Mon, Nov 16, 2020 at 4:00 PM Robert Burke  wrote:
>
> That's good historical context.
>
> But then we'd still need to codify the annotation would need to be optional, 
> and not affect correctness.
>
> Conflicts become easier to manage, (as environments with conflicting 
> annotations simply don't get merged, and stay as distinct environments) but 
> are still notionally annotation dependant. Do most runners handle 
> environments so individually though?
>
> Reuven's question is a good one though. For the Go SDK, and the proposed 
> implementation i saw, they only applied to leaf nodes. This is an artifact of 
> how the Go SDK handles composites. Nothing stops it from being implemented on 
> the composites Go has, but it didn't make sense otherwise. AFAICT Composites 
> are generally for organizational convenience and not for functional aspects. 
> Is this wrong? Afterall, does it make sense for environments to be on leaves 
> and composites either? It's the same issue there.
>
>
> On Mon, Nov 16, 2020, 3:45 PM Kenneth Knowles  wrote:
>>
>> I am +1 to the proposal but believe it should be moved to the Environment. I 
>> could be convinced otherwise, but would want to really understand the 
>> details.
>>
>> I think we haven't done a great job communicating the purpose of the 
>> Environment proto. It was explicitly created for this purpose.
>>
>> 1. It tells the runner things it needs to know to interpret the DoFn (or 
>> other UDF). So these are the existing proto fields like docker image (in the 
>> payload) and required artifacts that were staged.
>> 2. It is also the place for additional requirements or hints like "high mem" 
>> or "GPU" etc.
>>
>> Every user function has an associated environment, and environments exist 
>> only for the purpose of executing user functions. In fact, Environment 
>> originated as inline requirements/attributes for a user function proto 
>> message and was separated just to make the proto smaller.
>>
>> A PTransform is an abstract concept for organizing the graph, not an 
>> executable thing. So a hint/capability/requirement on a PTransform only 
>> really makes sense as a scoping mechanism for applying a hint to a bunch of 
>> functions within a subgraph. This seems like a user interface concern and 
>> the SDK should own propagating the hints. If the hint truly applies to the 
>> whole PTransform and *not* the parts, then I am interested in learning about 
>> that.
>>
>> Kenn
>>
>> On Mon, Nov 16, 2020 at 10:54 AM Robert Burke  wrote:
>>>
>>> That's a good question.
>>>
>>> I think the main difference is a matter of scope. Annotations would apply 
>>> to a PTransform while an environment applies to sets of transforms. A 
>>> difference is the optional nature of the annotations they don't affect 
>>> correctness. Runners don't need to do anything with them and still execute 
>>> the pipeline correctly.
>>>
>>> Consider a privacy analysis on a pipeline graph. An annotation indicating 
>>> that a transform provides a certain level of anonymization can be used in 
>>> an analysis to determine if the downstream transforms are encountering raw 
>>> data or not.
>>>
>>> From my understanding (which can be wrong) environments are rigid. 
>>> Transforms in different environments can't be fused. "This is the python 
>>> env", "this is the java env" can't be merged together. It's not clear to me 
>>> that we have defined when environments are safely fuseable outside of 
>>> equality. There's value in that simplicity.
>>>
>>> AFIACT environment has less to do with the machines a pipeline is executing 
>>> on than it does about the kinds of SDK pipelines it understands and can 
>>> execute.
>>>
>>>
>>>
>>> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova  wrote:
>
>
> Another example of an optional annotation is marking a transform to run 
> on secure hardware, or to give hints to profiling/dynamic analysis tools.


 There seems to be a lot of overlap between this idea and Environments.  
 Can you talk about how you feel they may be different or related?  For 
 example, I could see annotations as a way of tagging transforms with an 
 Environment, or I could see Environments becoming a specialized form of 
 annotation.

 -chad



Re: PTransform Annotations Proposal

2020-11-16 Thread Robert Burke
That's good historical context.

But then we'd still need to codify the annotation would need to be
optional, and not affect correctness.

Conflicts become easier to manage, (as environments with conflicting
annotations simply don't get merged, and stay as distinct environments) but
are still notionally annotation dependant. Do most runners handle
environments so individually though?

Reuven's question is a good one though. For the Go SDK, and the proposed
implementation i saw, they only applied to leaf nodes. This is an artifact
of how the Go SDK handles composites. Nothing stops it from being
implemented on the composites Go has, but it didn't make sense otherwise.
AFAICT Composites are generally for organizational convenience and not for
functional aspects. Is this wrong? Afterall, does it make sense for
environments to be on leaves and composites either? It's the same issue
there.


On Mon, Nov 16, 2020, 3:45 PM Kenneth Knowles  wrote:

> I am +1 to the proposal but believe it should be moved to the Environment.
> I could be convinced otherwise, but would want to really understand the
> details.
>
> I think we haven't done a great job communicating the purpose of the
> Environment proto. It was explicitly created for this purpose.
>
> 1. It tells the runner things it needs to know to interpret the DoFn (or
> other UDF). So these are the existing proto fields like docker image (in
> the payload) and required artifacts that were staged.
> 2. It is also the place for additional requirements or hints like "high
> mem" or "GPU" etc.
>
> Every user function has an associated environment, and environments exist
> only for the purpose of executing user functions. In fact, Environment
> originated as inline requirements/attributes for a user function proto
> message and was separated just to make the proto smaller.
>
> A PTransform is an abstract concept for organizing the graph, not an
> executable thing. So a hint/capability/requirement on a PTransform only
> really makes sense as a scoping mechanism for applying a hint to a bunch of
> functions within a subgraph. This seems like a user interface concern and
> the SDK should own propagating the hints. If the hint truly applies to the
> whole PTransform and *not* the parts, then I am interested in learning
> about that.
>
> Kenn
>
> On Mon, Nov 16, 2020 at 10:54 AM Robert Burke  wrote:
>
>> That's a good question.
>>
>> I think the main difference is a matter of scope. Annotations would apply
>> to a PTransform while an environment applies to sets of transforms. A
>> difference is the optional nature of the annotations they don't affect
>> correctness. Runners don't need to do anything with them and still execute
>> the pipeline correctly.
>>
>> Consider a privacy analysis on a pipeline graph. An annotation indicating
>> that a transform provides a certain level of anonymization can be used in
>> an analysis to determine if the downstream transforms are encountering raw
>> data or not.
>>
>> From my understanding (which can be wrong) environments are rigid.
>> Transforms in different environments can't be fused. "This is the python
>> env", "this is the java env" can't be merged together. It's not clear to me
>> that we have defined when environments are safely fuseable outside of
>> equality. There's value in that simplicity.
>>
>> AFIACT environment has less to do with the machines a pipeline is
>> executing on than it does about the kinds of SDK pipelines it understands
>> and can execute.
>>
>>
>>
>> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova  wrote:
>>
>>>
 Another example of an optional annotation is marking a transform to run
 on secure hardware, or to give hints to profiling/dynamic analysis tools.

>>>
>>> There seems to be a lot of overlap between this idea and Environments.
>>> Can you talk about how you feel they may be different or related?  For
>>> example, I could see annotations as a way of tagging transforms with an
>>> Environment, or I could see Environments becoming a specialized form of
>>> annotation.
>>>
>>> -chad
>>>
>>>


Re: PTransform Annotations Proposal

2020-11-16 Thread Kenneth Knowles
I am +1 to the proposal but believe it should be moved to the Environment.
I could be convinced otherwise, but would want to really understand the
details.

I think we haven't done a great job communicating the purpose of the
Environment proto. It was explicitly created for this purpose.

1. It tells the runner things it needs to know to interpret the DoFn (or
other UDF). So these are the existing proto fields like docker image (in
the payload) and required artifacts that were staged.
2. It is also the place for additional requirements or hints like "high
mem" or "GPU" etc.

Every user function has an associated environment, and environments exist
only for the purpose of executing user functions. In fact, Environment
originated as inline requirements/attributes for a user function proto
message and was separated just to make the proto smaller.

A PTransform is an abstract concept for organizing the graph, not an
executable thing. So a hint/capability/requirement on a PTransform only
really makes sense as a scoping mechanism for applying a hint to a bunch of
functions within a subgraph. This seems like a user interface concern and
the SDK should own propagating the hints. If the hint truly applies to the
whole PTransform and *not* the parts, then I am interested in learning
about that.

Kenn

On Mon, Nov 16, 2020 at 10:54 AM Robert Burke  wrote:

> That's a good question.
>
> I think the main difference is a matter of scope. Annotations would apply
> to a PTransform while an environment applies to sets of transforms. A
> difference is the optional nature of the annotations they don't affect
> correctness. Runners don't need to do anything with them and still execute
> the pipeline correctly.
>
> Consider a privacy analysis on a pipeline graph. An annotation indicating
> that a transform provides a certain level of anonymization can be used in
> an analysis to determine if the downstream transforms are encountering raw
> data or not.
>
> From my understanding (which can be wrong) environments are rigid.
> Transforms in different environments can't be fused. "This is the python
> env", "this is the java env" can't be merged together. It's not clear to me
> that we have defined when environments are safely fuseable outside of
> equality. There's value in that simplicity.
>
> AFIACT environment has less to do with the machines a pipeline is
> executing on than it does about the kinds of SDK pipelines it understands
> and can execute.
>
>
>
> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova  wrote:
>
>>
>>> Another example of an optional annotation is marking a transform to run
>>> on secure hardware, or to give hints to profiling/dynamic analysis tools.
>>>
>>
>> There seems to be a lot of overlap between this idea and Environments.
>> Can you talk about how you feel they may be different or related?  For
>> example, I could see annotations as a way of tagging transforms with an
>> Environment, or I could see Environments becoming a specialized form of
>> annotation.
>>
>> -chad
>>
>>


Re: PTransform Annotations Proposal

2020-11-16 Thread Jan Lukavský
Minor correction, the CoGBK broadcast vs. full shuffle is probably not 
ideal example, because it still requires grouping the larger PCollection 
(if not already grouped). If we take Join PTransform that acts on 
cartesian product of these groups, then it works well.


Jan

On 11/16/20 8:39 PM, Jan Lukavský wrote:


Hi,

could this proposal be generalized to annotations of PCollections as 
well? Maybe that reduces to several types of annotations of a 
PTransform - e.g.


 a) runtime annotations of a PTransform (that might be scheduling 
hints - i.e. schedule this task to nodes with GPUs, etc.)


 b) output annotations - i.e. annotations that actually apply to 
PCollections, as every PCollection has at most one producer (this is 
what have been actually discussed in the referenced mailing list threads)


It would be cool, if this added option to do PTransform expansions 
based on annotations of input PCollections. We tried to play with this 
in Euphoria DSL, but it turned out it would be best fitted in Beam SDK.


Example of input annotation sensitive expansion might be CoGBK, when 
one side is annotated i.e. FitsInMemoryPerWindow (or SmallPerWindow, 
or whatever), then CoGBK might be expanded using broadcast instead of 
full shuffle.


Absolutely agree that all this must not have anything to do with 
semantics and correctness, thus might be safely ignored, and that 
might answer the last question of @Reuven, when there are conflicting 
annotations, it would be possible to simple drop them as a last resort.


Jan

On 11/16/20 8:13 PM, Robert Burke wrote:
I imagine it has everything to do with the specific annotation to 
define that.


The runner notionally doesn't need to do anything with them, as they 
are optional, and not required for correctness.


On Mon, Nov 16, 2020, 10:56 AM Reuven Lax > wrote:


PTransforms are hierarchical - namely a PTransform contains other
PTransforms, and so on. Is the runner expected to resolve all
annotations down to leaf nodes? What happens if that results in
conflicting annotations?

On Mon, Nov 16, 2020 at 10:54 AM Robert Burke mailto:rob...@frantil.com>> wrote:

That's a good question.

I think the main difference is a matter of scope. Annotations
would apply to a PTransform while an environment applies to
sets of transforms. A difference is the optional nature of
the annotations they don't affect correctness. Runners don't
need to do anything with them and still execute the pipeline
correctly.

Consider a privacy analysis on a pipeline graph. An
annotation indicating that a transform provides a certain
level of anonymization can be used in an analysis to
determine if the downstream transforms are encountering raw
data or not.

From my understanding (which can be wrong) environments are
rigid. Transforms in different environments can't be fused.
"This is the python env", "this is the java env" can't be
merged together. It's not clear to me that we have defined
when environments are safely fuseable outside of equality.
There's value in that simplicity.

AFIACT environment has less to do with the machines a
pipeline is executing on than it does about the kinds of SDK
pipelines it understands and can execute.



On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova
mailto:chad...@gmail.com>> wrote:


Another example of an optional annotation is marking
a transform to run on secure hardware, or to give
hints to profiling/dynamic analysis tools.


There seems to be a lot of overlap between this idea and
Environments.  Can you talk about how you feel they may
be different or related?  For example, I could see
annotations as a way of tagging transforms with an
Environment, or I could see Environments becoming a
specialized form of annotation.

-chad



Re: PTransform Annotations Proposal

2020-11-16 Thread Jan Lukavský

Hi,

could this proposal be generalized to annotations of PCollections as 
well? Maybe that reduces to several types of annotations of a PTransform 
- e.g.


 a) runtime annotations of a PTransform (that might be scheduling hints 
- i.e. schedule this task to nodes with GPUs, etc.)


 b) output annotations - i.e. annotations that actually apply to 
PCollections, as every PCollection has at most one producer (this is 
what have been actually discussed in the referenced mailing list threads)


It would be cool, if this added option to do PTransform expansions based 
on annotations of input PCollections. We tried to play with this in 
Euphoria DSL, but it turned out it would be best fitted in Beam SDK.


Example of input annotation sensitive expansion might be CoGBK, when one 
side is annotated i.e. FitsInMemoryPerWindow (or SmallPerWindow, or 
whatever), then CoGBK might be expanded using broadcast instead of full 
shuffle.


Absolutely agree that all this must not have anything to do with 
semantics and correctness, thus might be safely ignored, and that might 
answer the last question of @Reuven, when there are conflicting 
annotations, it would be possible to simple drop them as a last resort.


Jan

On 11/16/20 8:13 PM, Robert Burke wrote:
I imagine it has everything to do with the specific annotation to 
define that.


The runner notionally doesn't need to do anything with them, as they 
are optional, and not required for correctness.


On Mon, Nov 16, 2020, 10:56 AM Reuven Lax > wrote:


PTransforms are hierarchical - namely a PTransform contains other
PTransforms, and so on. Is the runner expected to resolve all
annotations down to leaf nodes? What happens if that results in
conflicting annotations?

On Mon, Nov 16, 2020 at 10:54 AM Robert Burke mailto:rob...@frantil.com>> wrote:

That's a good question.

I think the main difference is a matter of scope. Annotations
would apply to a PTransform while an environment applies to
sets of transforms. A difference is the optional nature of the
annotations they don't affect correctness. Runners don't need
to do anything with them and still execute the pipeline
correctly.

Consider a privacy analysis on a pipeline graph. An annotation
indicating that a transform provides a certain level of
anonymization can be used in an analysis to determine if the
downstream transforms are encountering raw data or not.

From my understanding (which can be wrong) environments are
rigid. Transforms in different environments can't be fused.
"This is the python env", "this is the java env" can't be
merged together. It's not clear to me that we have defined
when environments are safely fuseable outside of equality.
There's value in that simplicity.

AFIACT environment has less to do with the machines a pipeline
is executing on than it does about the kinds of SDK pipelines
it understands and can execute.



On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova
mailto:chad...@gmail.com>> wrote:


Another example of an optional annotation is marking a
transform to run on secure hardware, or to give hints
to profiling/dynamic analysis tools.


There seems to be a lot of overlap between this idea and
Environments.  Can you talk about how you feel they may be
different or related? For example, I could see annotations
as a way of tagging transforms with an Environment, or I
could see Environments becoming a specialized form of
annotation.

-chad



Re: PTransform Annotations Proposal

2020-11-16 Thread Robert Burke
I imagine it has everything to do with the specific annotation to define
that.

The runner notionally doesn't need to do anything with them, as they are
optional, and not required for correctness.

On Mon, Nov 16, 2020, 10:56 AM Reuven Lax  wrote:

> PTransforms are hierarchical - namely a PTransform contains other
> PTransforms, and so on. Is the runner expected to resolve all annotations
> down to leaf nodes? What happens if that results in conflicting annotations?
>
> On Mon, Nov 16, 2020 at 10:54 AM Robert Burke  wrote:
>
>> That's a good question.
>>
>> I think the main difference is a matter of scope. Annotations would apply
>> to a PTransform while an environment applies to sets of transforms. A
>> difference is the optional nature of the annotations they don't affect
>> correctness. Runners don't need to do anything with them and still execute
>> the pipeline correctly.
>>
>> Consider a privacy analysis on a pipeline graph. An annotation indicating
>> that a transform provides a certain level of anonymization can be used in
>> an analysis to determine if the downstream transforms are encountering raw
>> data or not.
>>
>> From my understanding (which can be wrong) environments are rigid.
>> Transforms in different environments can't be fused. "This is the python
>> env", "this is the java env" can't be merged together. It's not clear to me
>> that we have defined when environments are safely fuseable outside of
>> equality. There's value in that simplicity.
>>
>> AFIACT environment has less to do with the machines a pipeline is
>> executing on than it does about the kinds of SDK pipelines it understands
>> and can execute.
>>
>>
>>
>> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova  wrote:
>>
>>>
 Another example of an optional annotation is marking a transform to run
 on secure hardware, or to give hints to profiling/dynamic analysis tools.

>>>
>>> There seems to be a lot of overlap between this idea and Environments.
>>> Can you talk about how you feel they may be different or related?  For
>>> example, I could see annotations as a way of tagging transforms with an
>>> Environment, or I could see Environments becoming a specialized form of
>>> annotation.
>>>
>>> -chad
>>>
>>>


Re: PTransform Annotations Proposal

2020-11-16 Thread Reuven Lax
PTransforms are hierarchical - namely a PTransform contains other
PTransforms, and so on. Is the runner expected to resolve all annotations
down to leaf nodes? What happens if that results in conflicting annotations?

On Mon, Nov 16, 2020 at 10:54 AM Robert Burke  wrote:

> That's a good question.
>
> I think the main difference is a matter of scope. Annotations would apply
> to a PTransform while an environment applies to sets of transforms. A
> difference is the optional nature of the annotations they don't affect
> correctness. Runners don't need to do anything with them and still execute
> the pipeline correctly.
>
> Consider a privacy analysis on a pipeline graph. An annotation indicating
> that a transform provides a certain level of anonymization can be used in
> an analysis to determine if the downstream transforms are encountering raw
> data or not.
>
> From my understanding (which can be wrong) environments are rigid.
> Transforms in different environments can't be fused. "This is the python
> env", "this is the java env" can't be merged together. It's not clear to me
> that we have defined when environments are safely fuseable outside of
> equality. There's value in that simplicity.
>
> AFIACT environment has less to do with the machines a pipeline is
> executing on than it does about the kinds of SDK pipelines it understands
> and can execute.
>
>
>
> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova  wrote:
>
>>
>>> Another example of an optional annotation is marking a transform to run
>>> on secure hardware, or to give hints to profiling/dynamic analysis tools.
>>>
>>
>> There seems to be a lot of overlap between this idea and Environments.
>> Can you talk about how you feel they may be different or related?  For
>> example, I could see annotations as a way of tagging transforms with an
>> Environment, or I could see Environments becoming a specialized form of
>> annotation.
>>
>> -chad
>>
>>


Re: PTransform Annotations Proposal

2020-11-16 Thread Robert Burke
That's a good question.

I think the main difference is a matter of scope. Annotations would apply
to a PTransform while an environment applies to sets of transforms. A
difference is the optional nature of the annotations they don't affect
correctness. Runners don't need to do anything with them and still execute
the pipeline correctly.

Consider a privacy analysis on a pipeline graph. An annotation indicating
that a transform provides a certain level of anonymization can be used in
an analysis to determine if the downstream transforms are encountering raw
data or not.

>From my understanding (which can be wrong) environments are rigid.
Transforms in different environments can't be fused. "This is the python
env", "this is the java env" can't be merged together. It's not clear to me
that we have defined when environments are safely fuseable outside of
equality. There's value in that simplicity.

AFIACT environment has less to do with the machines a pipeline is executing
on than it does about the kinds of SDK pipelines it understands and can
execute.



On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova  wrote:

>
>> Another example of an optional annotation is marking a transform to run
>> on secure hardware, or to give hints to profiling/dynamic analysis tools.
>>
>
> There seems to be a lot of overlap between this idea and Environments.
> Can you talk about how you feel they may be different or related?  For
> example, I could see annotations as a way of tagging transforms with an
> Environment, or I could see Environments becoming a specialized form of
> annotation.
>
> -chad
>
>


Re: PTransform Annotations Proposal

2020-11-16 Thread Chad Dombrova
>
>
> Another example of an optional annotation is marking a transform to run on
> secure hardware, or to give hints to profiling/dynamic analysis tools.
>

There seems to be a lot of overlap between this idea and Environments.  Can
you talk about how you feel they may be different or related?  For example,
I could see annotations as a way of tagging transforms with an Environment,
or I could see Environments becoming a specialized form of annotation.

-chad


Re: PTransform Annotations Proposal

2020-11-16 Thread Reza Ardeshir Rokni
+1 having a NeedsRam(x) annotation would be incredibly helpful.

On Fri, 13 Nov 2020 at 05:57, Robert Burke  wrote:

> (Disclaimer, Mirac and their team did approach me about this beforehand as
> their interest is in the Go SDK.)
>
> +1 I think it's a good idea. As you've pointed out, there are many
> opportunities for optional pipeline analysis here as well.
>
> A strawman counter point would be to re-used the static DisplayData for
> this kind of thing, but I think that's not necessarily the same thing. It's
> very hard to get something that's purely intended for Human consumption to
> also be suitable for machine consumption, without various adapters and
> such, and it would be an awful hack. Having something specifically for
> Machines to understand is valuable in and of itself.
>
> I appreciate the versatility of simply using known URNs and their defined
> formats, and especially keeping the proposal to optional annotations that
> don't affect correctness. This will work well with most DoFns that need
> specialized hardware. They can usually be emulated on ordinary CPUs, which
> is good for testing, but can perform much better if the hardware is
> available. This also allows the runners to move execution of specific DoFns
> to the machines with the specialized hardware, for better scheduling of
> resources.
>
> I look forward to the PR, and before then, all the discussion the
> community has about this new field in the model proto.
>
>
>
>
>
> On Thu, 12 Nov 2020 at 09:41, Mirac Vuslat Basaran 
> wrote:
>
>> Hi all,
>>
>> We would like to propose adding functionality to add annotations to Beam
>> transforms. These annotations would be readable by the runner, and the
>> runner could then act on this information; for example by doing some
>> special resource allocation. There have been discussions around annotations
>> (or hints as they are sometimes called) in the past (
>> https://lists.apache.org/thread.html/rdf247cfa3a509f80578f03b2454ea1e50474ee3576a059486d58fdf4%40%3Cdev.beam.apache.org%3E,
>>
>> https://lists.apache.org/thread.html/fc090d8acd96c4cf2d23071b5d99f538165d3ff7fbe6f65297655309%40%3Cdev.beam.apache.org%3E).
>> This proposal aims to come up with an accepted lightweight solution with a
>> follow-up Pull Request to implement it in Go.
>>
>> By annotations, we refer to optional information / hints provided to the
>> runner. This proposal explicitly excludes “required” annotations that could
>> cause incorrect output. A runner that does not understand the annotations
>> and ignores them must still produce correct output, with perhaps a
>> degradation in performance or other nonfunctional requirements. Supporting
>> only “optional” annotations allows for compatibility with runners that do
>> not recognize those annotations.
>>
>> A good example of an optional annotation is marking a transform to be run
>> on GPU or TPU or that it needs a certain amount of RAM. If the runner knows
>> about this annotation, it can then allocate the requested resources for
>> that transform only to improve performance and avoid using these scarce
>> resources for other transforms.
>>
>> Another example of an optional annotation is marking a transform to run
>> on secure hardware, or to give hints to profiling/dynamic analysis tools.
>>
>> In all these cases, the runner can run the pipeline with or without the
>> annotation, and in both cases the same output would be produced. There
>> would be differences in nonfunctional requirements (performance, security,
>> ease of profiling), hence the optional part.
>>
>> A counter-example that this proposal explicitly excludes is marking a
>> transform as requiring sorted input. For example, on a transform that
>> expects time-sorted input in order to produce the correct output. If the
>> runner ignores this requirement, it would risk producing an incorrect
>> output. In order to avoid this, we exclude these required annotations.
>>
>> Implementation-wise, we propose to add a field:
>>  - map annotations = 8;
>> to PTransform proto (
>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L127).
>> The key would be a URN that uniquely identifies the type of annotation. The
>> value is an opaque byte array (e.g., a serialized protocol buffer) to allow
>> for maximum flexibility to the implementation of that specific type of
>> annotation.
>>
>> We have a specific interest in adding this to the Go SDK. In Go, the user
>> would specify the annotations to a structural ParDo as follows, by defining
>> a field:
>>  - Annotations map[string][]byte
>> and filling it out. For simplicity, we will only support structural doFns
>> in Go for the time being.
>>
>> The runners could then read the annotations from the PTransform proto and
>> support the annotations that they would like to in the way they want.
>>
>> Please let me know what you think, and what would be the best way to
>> proceed, e.g., we can share a small design doc or, in case 

Re: PTransform Annotations Proposal

2020-11-12 Thread Robert Burke
(Disclaimer, Mirac and their team did approach me about this beforehand as
their interest is in the Go SDK.)

+1 I think it's a good idea. As you've pointed out, there are many
opportunities for optional pipeline analysis here as well.

A strawman counter point would be to re-used the static DisplayData for
this kind of thing, but I think that's not necessarily the same thing. It's
very hard to get something that's purely intended for Human consumption to
also be suitable for machine consumption, without various adapters and
such, and it would be an awful hack. Having something specifically for
Machines to understand is valuable in and of itself.

I appreciate the versatility of simply using known URNs and their defined
formats, and especially keeping the proposal to optional annotations that
don't affect correctness. This will work well with most DoFns that need
specialized hardware. They can usually be emulated on ordinary CPUs, which
is good for testing, but can perform much better if the hardware is
available. This also allows the runners to move execution of specific DoFns
to the machines with the specialized hardware, for better scheduling of
resources.

I look forward to the PR, and before then, all the discussion the community
has about this new field in the model proto.





On Thu, 12 Nov 2020 at 09:41, Mirac Vuslat Basaran 
wrote:

> Hi all,
>
> We would like to propose adding functionality to add annotations to Beam
> transforms. These annotations would be readable by the runner, and the
> runner could then act on this information; for example by doing some
> special resource allocation. There have been discussions around annotations
> (or hints as they are sometimes called) in the past (
> https://lists.apache.org/thread.html/rdf247cfa3a509f80578f03b2454ea1e50474ee3576a059486d58fdf4%40%3Cdev.beam.apache.org%3E,
>
> https://lists.apache.org/thread.html/fc090d8acd96c4cf2d23071b5d99f538165d3ff7fbe6f65297655309%40%3Cdev.beam.apache.org%3E).
> This proposal aims to come up with an accepted lightweight solution with a
> follow-up Pull Request to implement it in Go.
>
> By annotations, we refer to optional information / hints provided to the
> runner. This proposal explicitly excludes “required” annotations that could
> cause incorrect output. A runner that does not understand the annotations
> and ignores them must still produce correct output, with perhaps a
> degradation in performance or other nonfunctional requirements. Supporting
> only “optional” annotations allows for compatibility with runners that do
> not recognize those annotations.
>
> A good example of an optional annotation is marking a transform to be run
> on GPU or TPU or that it needs a certain amount of RAM. If the runner knows
> about this annotation, it can then allocate the requested resources for
> that transform only to improve performance and avoid using these scarce
> resources for other transforms.
>
> Another example of an optional annotation is marking a transform to run on
> secure hardware, or to give hints to profiling/dynamic analysis tools.
>
> In all these cases, the runner can run the pipeline with or without the
> annotation, and in both cases the same output would be produced. There
> would be differences in nonfunctional requirements (performance, security,
> ease of profiling), hence the optional part.
>
> A counter-example that this proposal explicitly excludes is marking a
> transform as requiring sorted input. For example, on a transform that
> expects time-sorted input in order to produce the correct output. If the
> runner ignores this requirement, it would risk producing an incorrect
> output. In order to avoid this, we exclude these required annotations.
>
> Implementation-wise, we propose to add a field:
>  - map annotations = 8;
> to PTransform proto (
> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L127).
> The key would be a URN that uniquely identifies the type of annotation. The
> value is an opaque byte array (e.g., a serialized protocol buffer) to allow
> for maximum flexibility to the implementation of that specific type of
> annotation.
>
> We have a specific interest in adding this to the Go SDK. In Go, the user
> would specify the annotations to a structural ParDo as follows, by defining
> a field:
>  - Annotations map[string][]byte
> and filling it out. For simplicity, we will only support structural doFns
> in Go for the time being.
>
> The runners could then read the annotations from the PTransform proto and
> support the annotations that they would like to in the way they want.
>
> Please let me know what you think, and what would be the best way to
> proceed, e.g., we can share a small design doc or, in case there are no
> major objections, directly create a pull request for Go where we can
> discuss the implementation details.
>
> Best,
> Mirac and team
>


PTransform Annotations Proposal

2020-11-12 Thread Mirac Vuslat Basaran
Hi all,

We would like to propose adding functionality to add annotations to Beam
transforms. These annotations would be readable by the runner, and the
runner could then act on this information; for example by doing some
special resource allocation. There have been discussions around annotations
(or hints as they are sometimes called) in the past (
https://lists.apache.org/thread.html/rdf247cfa3a509f80578f03b2454ea1e50474ee3576a059486d58fdf4%40%3Cdev.beam.apache.org%3E,
https://lists.apache.org/thread.html/fc090d8acd96c4cf2d23071b5d99f538165d3ff7fbe6f65297655309%40%3Cdev.beam.apache.org%3E).
This proposal aims to come up with an accepted lightweight solution with a
follow-up Pull Request to implement it in Go.

By annotations, we refer to optional information / hints provided to the
runner. This proposal explicitly excludes “required” annotations that could
cause incorrect output. A runner that does not understand the annotations
and ignores them must still produce correct output, with perhaps a
degradation in performance or other nonfunctional requirements. Supporting
only “optional” annotations allows for compatibility with runners that do
not recognize those annotations.

A good example of an optional annotation is marking a transform to be run
on GPU or TPU or that it needs a certain amount of RAM. If the runner knows
about this annotation, it can then allocate the requested resources for
that transform only to improve performance and avoid using these scarce
resources for other transforms.

Another example of an optional annotation is marking a transform to run on
secure hardware, or to give hints to profiling/dynamic analysis tools.

In all these cases, the runner can run the pipeline with or without the
annotation, and in both cases the same output would be produced. There
would be differences in nonfunctional requirements (performance, security,
ease of profiling), hence the optional part.

A counter-example that this proposal explicitly excludes is marking a
transform as requiring sorted input. For example, on a transform that
expects time-sorted input in order to produce the correct output. If the
runner ignores this requirement, it would risk producing an incorrect
output. In order to avoid this, we exclude these required annotations.

Implementation-wise, we propose to add a field:
 - map annotations = 8;
to PTransform proto (
https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L127).
The key would be a URN that uniquely identifies the type of annotation. The
value is an opaque byte array (e.g., a serialized protocol buffer) to allow
for maximum flexibility to the implementation of that specific type of
annotation.

We have a specific interest in adding this to the Go SDK. In Go, the user
would specify the annotations to a structural ParDo as follows, by defining
a field:
 - Annotations map[string][]byte
and filling it out. For simplicity, we will only support structural doFns
in Go for the time being.

The runners could then read the annotations from the PTransform proto and
support the annotations that they would like to in the way they want.

Please let me know what you think, and what would be the best way to
proceed, e.g., we can share a small design doc or, in case there are no
major objections, directly create a pull request for Go where we can
discuss the implementation details.

Best,
Mirac and team