Re: PTransform Annotations Proposal

2021-01-13 Thread Mirac Vuslat Basaran
Wrote a design draft for resource-related annotations. Please have a look: 
https://docs.google.com/document/d/1phExeGD1gdDI9M8LK4ZG57UGa7dswpB8Aj6jxWj4uQk/edit?usp=sharing

Cheers,
Mirac
On 2020/11/26 20:20:09, Mirac Vuslat Basaran  wrote: 
> Created a PR without unit tests at https://github.com/apache/beam/pull/13434. 
> Please have a look.
> 
> Thanks,
> Mirac
> 
> 
> On 2020/11/25 18:50:19, Robert Burke  wrote: 
> > Hmmm. Fair. I'm mostly concerned about the pathological case where we end
> > up with a distinct Environment per transform, but there are likely
> > practical cases where that's reasonable (High mem to GPU to TPU, to ARM)
> > 
> > On Wed, Nov 25, 2020, 10:42 AM Robert Bradshaw  wrote:
> > 
> > > I'd like to continue the discussion *and* see an implementation for
> > > the part we've settled on. I was asking why not have "every distinct
> > > physical concern means a distinct environment?"
> > >
> > > On Wed, Nov 25, 2020 at 10:38 AM Robert Burke  wrote:
> > > >
> > > > Mostly because perfect is the enemy of good enough. We have a proposal,
> > > we have clear boundaries for it. It's fine if the discussion continues, 
> > > but
> > > I see no evidence of concerns that should prevent starting an
> > > implementation, because it seems we'll need both anyway.
> > > >
> > > > On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw 
> > > wrote:
> > > >>
> > > >> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke 
> > > wrote:
> > > >> >
> > > >> > It sounds like we've come to the position that non-correctness
> > > affecting Ptransform Annotations are valuable at both leaf and composite
> > > levels, and don't remove the potential need for a similar feature on
> > > Environments, to handle physical concerns equirements for worker processes
> > > to have (such as Ram, CPU, or GPU requirements.)
> > > >> >
> > > >> > Kenn, it's not clear what part of the solution (an annotation field
> > > on the Ptransform proto message) would need to change to satisfy your 
> > > scope
> > > concern, beyond documenting unambiguously that these may not be used for
> > > physical concerns or things that affect correctness.
> > > >>
> > > >> I'll let Kenn answer as well, but from my point of view, explicitly
> > > >> having somewhere better to put these things would help.
> > > >>
> > > >> > I'm also unclear your scope concern not matching, given the above.
> > > Your first paragraph reads very supportive of logical annotations on
> > > Ptransforms, and that matches 1-1 with the current proposed solution. Can
> > > you clarify your concern?
> > > >> >
> > > >> > I don't wish to scope creep on the physical requirements issue at
> > > this time. It seems we are agreed they should end up on environments, but
> > > I'm not seeing proposals on the right way to execute them at this 
> > > time.They
> > > seem to be a fruitful topic of discussion, in particular
> > > unifying/consolidating them for efficient use of resources. I don't think
> > > we want to end up in a state where every distinct physical concern means a
> > > distinct environment.
> > > >>
> > > >> Why not? Assuming, of course, that runners are free to merge
> > > >> environments (merging those resource hints they understand and are
> > > >> otherwise compatible, and discarding those they don't) for efficient
> > > >> execution.
> > > >>
> > > >> > I for one am ready to see a PR.
> > > >>
> > > >> +1
> > > >>
> > > >> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles 
> > > wrote:
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw 
> > > >> >> 
> > > wrote:
> > > >> >>>
> > > >> >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran <
> > > mir...@google.com> wrote:
> > > >> >>> >
> > > >> >>> > Thanks everyone so much for their input and for the insightful
> > > discussion.
> > > >> >>> >
> > > >> >>> > Not being knowledgeable about Beam's internals, I have t

Re: PTransform Annotations Proposal

2020-11-30 Thread Mirac Vuslat Basaran
Created a PR without unit tests at https://github.com/apache/beam/pull/13434. 
Please have a look.

Thanks,
Mirac


On 2020/11/25 18:50:19, Robert Burke  wrote: 
> Hmmm. Fair. I'm mostly concerned about the pathological case where we end
> up with a distinct Environment per transform, but there are likely
> practical cases where that's reasonable (High mem to GPU to TPU, to ARM)
> 
> On Wed, Nov 25, 2020, 10:42 AM Robert Bradshaw  wrote:
> 
> > I'd like to continue the discussion *and* see an implementation for
> > the part we've settled on. I was asking why not have "every distinct
> > physical concern means a distinct environment?"
> >
> > On Wed, Nov 25, 2020 at 10:38 AM Robert Burke  wrote:
> > >
> > > Mostly because perfect is the enemy of good enough. We have a proposal,
> > we have clear boundaries for it. It's fine if the discussion continues, but
> > I see no evidence of concerns that should prevent starting an
> > implementation, because it seems we'll need both anyway.
> > >
> > > On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw 
> > wrote:
> > >>
> > >> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke 
> > wrote:
> > >> >
> > >> > It sounds like we've come to the position that non-correctness
> > affecting Ptransform Annotations are valuable at both leaf and composite
> > levels, and don't remove the potential need for a similar feature on
> > Environments, to handle physical concerns equirements for worker processes
> > to have (such as Ram, CPU, or GPU requirements.)
> > >> >
> > >> > Kenn, it's not clear what part of the solution (an annotation field
> > on the Ptransform proto message) would need to change to satisfy your scope
> > concern, beyond documenting unambiguously that these may not be used for
> > physical concerns or things that affect correctness.
> > >>
> > >> I'll let Kenn answer as well, but from my point of view, explicitly
> > >> having somewhere better to put these things would help.
> > >>
> > >> > I'm also unclear your scope concern not matching, given the above.
> > Your first paragraph reads very supportive of logical annotations on
> > Ptransforms, and that matches 1-1 with the current proposed solution. Can
> > you clarify your concern?
> > >> >
> > >> > I don't wish to scope creep on the physical requirements issue at
> > this time. It seems we are agreed they should end up on environments, but
> > I'm not seeing proposals on the right way to execute them at this time.They
> > seem to be a fruitful topic of discussion, in particular
> > unifying/consolidating them for efficient use of resources. I don't think
> > we want to end up in a state where every distinct physical concern means a
> > distinct environment.
> > >>
> > >> Why not? Assuming, of course, that runners are free to merge
> > >> environments (merging those resource hints they understand and are
> > >> otherwise compatible, and discarding those they don't) for efficient
> > >> execution.
> > >>
> > >> > I for one am ready to see a PR.
> > >>
> > >> +1
> > >>
> > >> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles 
> > wrote:
> > >> >>
> > >> >>
> > >> >>
> > >> >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw 
> > wrote:
> > >> >>>
> > >> >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran <
> > mir...@google.com> wrote:
> > >> >>> >
> > >> >>> > Thanks everyone so much for their input and for the insightful
> > discussion.
> > >> >>> >
> > >> >>> > Not being knowledgeable about Beam's internals, I have to say I
> > am a bit lost on the PTransform vs. environment discussion.
> > >> >>> >
> > >> >>> > I do agree with Burke's notion that merge rules are very
> > annotation dependent, I don't think we can find a one-size-fits-all
> > solution for that. So this might be actually be an argument in favour of
> > having annotations on PTransforms, since it avoids the conflation with
> > environments.
> > >> >>> >
> > >> >>> > Also in general, I feel that having annotations per single
> > transform (rather than composite) and on PTransforms could lead to a
> > simpler design.
> > >> >>&g

Re: PTransform Annotations Proposal

2020-11-20 Thread Mirac Vuslat Basaran
Thanks everyone so much for their input and for the insightful discussion.

Not being knowledgeable about Beam's internals, I have to say I am a bit lost 
on the PTransform vs. environment discussion.

I do agree with Burke's notion that merge rules are very annotation dependent, 
I don't think we can find a one-size-fits-all solution for that. So this might 
be actually be an argument in favour of having annotations on PTransforms, 
since it avoids the conflation with environments.

Also in general, I feel that having annotations per single transform (rather 
than composite) and on PTransforms could lead to a simpler design.

Seeing as there are valuable arguments in favour of both (PTransform and 
environments) with no clear(?) "best solution", I would propose moving forward 
with the initial (PTransform) design to ship the feature and unblock teams 
asking for it. If it turns out that there was indeed a need to have annotations 
in environments, we could always refactor it.

On 2020/11/17 19:07:22, Robert Bradshaw  wrote: 
> So far we have two distinct usecases for annotations: resource hints
> and privacy directives, and I've been trying to figure out how to
> reconcile them, but they seem to have very different characteristics.
> (It would be nice to come up with other uses as well to see if we're
> really coming up with a generally useful mode--I think display data
> could fit into this as a new kind of annotation rather than being a
> top-level property, and it could make sense on both leaf and composite
> transforms.)
> 
> To me, resource hints like GPU are inextricably tied to the
> environment. A transform tagged with GPU should reference a Fn that
> invokes GPU-accelerated code that lives in a particular environment.
> Something like high-mem is a bit squishier. Some DoFns take a lot of
> memory, but on the other hand one could imagine labeling a CoGBK as
> high-mem due to knowing that, in this particular usage, there will be
> lots of values with the same key. Ideally runners would be intelligent
> enough to automatically learn memory usage, but even in this case it
> may be a good hint to try and learn the requirements for DoFn A and
> DoFn B separately (which is difficult if they are always colocated,
> but valuable if, e.g. A takes a huge amount of memory and B takes a
> huge amount of wall time).
> 
> Note that tying things to the environment does not preclude using them
> in non-portable runners as they'll still have an SDK-level
> representation (though I don't think we should have an explicit goal
> of feature parity for non-portable runners, e.g. multi-language isn't
> happening, and hope that non-portable runners go away soon anyway).
> 
> Now let's consider privacy annotations. To make things very concrete,
> imagine a transform AverageSpendPerZipCode which takes as input (user,
> zip, spend), all users unique, and returns (zip, avg(spend)). In
> Python, this is GroupBy('zip').aggregate_field('spend',
> MeanCombineFn()). This is not very privacy preserving to those users
> who are the only (or one of a few) in a zip code. So we could define a
> transform PrivacyPreservingAverageSpendPerZipCode as
> 
> @ptransform_fn
> def PrivacyPreservingAverageSpendPerZipCode(spend_per_user, threshold)
> counts_per_zip = spend_per_user |
> GroupBy('zip').aggregate_field('user', CountCombineFn())
> spend_per_zip = spend_per_user |
> GroupBy('zip').aggregate_field('spend', MeanCombineFn())
> filtered = spend_per_zip | beam.Filter(
> lambda x, counts: counts[x.zip] > threshold,
> counts=AsMap(counts_per_zip))
> return filtered
> 
> We now have a composite that has privacy preserving properties (i.e.
> the input may be quite sensitive, but the output is not, depending on
> the value of threshold). What is interesting here is that it is only
> the composite that has this property--no individual sub-transform is
> itself privacy preserving. Furthermore, an optimizer may notice we're
> doing aggregation on the same key twice and rewrite this using
> (logically)
> 
> GroupBy('zip').aggregate_field('user',
> CountCombineFn()).aggregate_field('spend', MeanCombineFn())
> 
> and then applying the filter, which is semantically equivalent and
> satisfies the privacy annotations (and notably that does not even
> require the optimizer to interpret the annotations, just pass them
> on). To me, this implies that these annotations belong on the
> composites, and *not* on the leaf nodes (where they would be
> incorrect).
> 
> I'll leave aside most questions of API until we figure out the model
> semantics, but wanted to throw one possible idea out (though I am
> ambivalent about it). Instead of attaching things to transforms, we
> can just wrap transforms in composites that have no role other than
> declaring information about their contents. E.g. we could have a
> composite transform whose payload is simply an assertion of the
> privacy (or resource?) properties of its inner structure. 

PTransform Annotations Proposal

2020-11-12 Thread Mirac Vuslat Basaran
Hi all,

We would like to propose adding functionality to add annotations to Beam
transforms. These annotations would be readable by the runner, and the
runner could then act on this information; for example by doing some
special resource allocation. There have been discussions around annotations
(or hints as they are sometimes called) in the past (
https://lists.apache.org/thread.html/rdf247cfa3a509f80578f03b2454ea1e50474ee3576a059486d58fdf4%40%3Cdev.beam.apache.org%3E,
https://lists.apache.org/thread.html/fc090d8acd96c4cf2d23071b5d99f538165d3ff7fbe6f65297655309%40%3Cdev.beam.apache.org%3E).
This proposal aims to come up with an accepted lightweight solution with a
follow-up Pull Request to implement it in Go.

By annotations, we refer to optional information / hints provided to the
runner. This proposal explicitly excludes “required” annotations that could
cause incorrect output. A runner that does not understand the annotations
and ignores them must still produce correct output, with perhaps a
degradation in performance or other nonfunctional requirements. Supporting
only “optional” annotations allows for compatibility with runners that do
not recognize those annotations.

A good example of an optional annotation is marking a transform to be run
on GPU or TPU or that it needs a certain amount of RAM. If the runner knows
about this annotation, it can then allocate the requested resources for
that transform only to improve performance and avoid using these scarce
resources for other transforms.

Another example of an optional annotation is marking a transform to run on
secure hardware, or to give hints to profiling/dynamic analysis tools.

In all these cases, the runner can run the pipeline with or without the
annotation, and in both cases the same output would be produced. There
would be differences in nonfunctional requirements (performance, security,
ease of profiling), hence the optional part.

A counter-example that this proposal explicitly excludes is marking a
transform as requiring sorted input. For example, on a transform that
expects time-sorted input in order to produce the correct output. If the
runner ignores this requirement, it would risk producing an incorrect
output. In order to avoid this, we exclude these required annotations.

Implementation-wise, we propose to add a field:
 - map annotations = 8;
to PTransform proto (
https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L127).
The key would be a URN that uniquely identifies the type of annotation. The
value is an opaque byte array (e.g., a serialized protocol buffer) to allow
for maximum flexibility to the implementation of that specific type of
annotation.

We have a specific interest in adding this to the Go SDK. In Go, the user
would specify the annotations to a structural ParDo as follows, by defining
a field:
 - Annotations map[string][]byte
and filling it out. For simplicity, we will only support structural doFns
in Go for the time being.

The runners could then read the annotations from the PTransform proto and
support the annotations that they would like to in the way they want.

Please let me know what you think, and what would be the best way to
proceed, e.g., we can share a small design doc or, in case there are no
major objections, directly create a pull request for Go where we can
discuss the implementation details.

Best,
Mirac and team