Re: [YAML] Aggregations

Byron Ellis via dev Thu, 19 Oct 2023 09:57:18 -0700

On Thu, Oct 19, 2023 at 9:28 AM Robert Bradshaw <rober...@google.com> wrote:


> On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis <byronel...@google.com> wrote:
> >
> > Rill is definitely SQL-oriented but I think that's going to be the most
> common. Dataframes are explicitly modeled on the relational approach so
> that's going to look a lot like SQL,
>
> I think pretty much any approach that fits here is going to be
> relational, meaning you choose a set of columns to group on, a set of
> columns to aggregate, and how to aggregate. The big open question is
> what syntax to use for the "how."
>
> Dataframe aggregation is probably a good example to look at. Here we
> have panda and R in particular as concrete instances. It should also
> be easy to support different aggregations over different (or the same)
> columns. Pandas can take a list of (or mapping to) functions in its
> groupby().agg(). R doesn't seem to make this very easy...
>

In R it's pretty similar if you're using "the tidyverse" (a colloquial name
for the R packages on which so far as I know pandas is based). Using dplyr
it would be something along the lines of

table %>% group_by(ColumnName) %>% summarize(OtherColumnAvg =
mean(OtherColumn))

(R is much more lazily evaluated than Python so the "mean(OtherColumn)" is
evaluated in the context of "table" by summarize. You are not wrong in
thinking "this is difficult to compile" :-)



>
> > which leaves us with S-style formulas (which I like but are pretty niche)
>
> I'm curious, what are these?
>

The other way of doing the above in R, which comes from S-PLUS is to use
formula constructs. It would look something like this:

aggregate(OtherColumn ~ ColumnName,data=table,mean)

~ is the "formula operator" so it creates something of type formula.

There are even older pivot table approaches as well, but that's probably
the most common pattern outside of the tidyverse. It would be technically
feasible to also do aggregate(mean(OtherColumn) + median(YetAnotherColumn)
~ ColumnName,data=table) though I can't remember if there's a variant that
supports that construct






>
> >  and I guess pivot tables coming from the spreadsheet world. Does make
> me wonder what Rails' ORM looks like these days (I last used v4), it had
> some aggregation support and was pretty declarative...
> >
> > On Wed, Oct 18, 2023 at 6:06 PM Robert Bradshaw <rober...@google.com>
> wrote:
> >>
> >> On Wed, Oct 18, 2023 at 5:06 PM Byron Ellis <byronel...@google.com>
> wrote:
> >> >
> >> > Is it worth taking a look at similar prior art in the space?
> >>
> >> +1. Pointers welcome.
> >>
> >> > The first one that comes to mind is Transform, but with the dbt labs
> acquisition that spec is a lot harder to find. Rill is pretty similar
> though.
> >>
> >> Rill seems to be very SQL-based.
> >>
> >> > On Wed, Oct 18, 2023 at 1:12 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
> >> >>
> >> >> Beam Yaml has good support for IOs and mappings, but one key missing
> >> >> feature for even writing a WordCount is the ability to do
> Aggregations
> >> >> [1]. While the traditional Beam primitive is GroupByKey (and
> >> >> CombineValues), we're eschewing KVs in the notion of more schema'd
> >> >> data (which has some precedence in our other languages, see the links
> >> >> below). The key components the user needs to specify are (1) the key
> >> >> fields on which the grouping will take place, (2) the fields
> >> >> (expressions?) involved in the aggregation, and (3) what aggregating
> >> >> fn to use.
> >> >>
> >> >> A straw-man example could be something like
> >> >>
> >> >> type: Aggregating
> >> >> config:
> >> >>   key: [field1, field2]
> >> >>   aggregating:
> >> >>     total_cost:
> >> >>       fn: sum
> >> >>       value: cost
> >> >>     max_cost:
> >> >>       fn: max
> >> >>       value: cost
> >> >>
> >> >> This would basically correspond to the SQL expression
> >> >>
> >> >> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as
> max_cost
> >> >> from table GROUP BY field1, field2"
> >> >>
> >> >> (though I'm not requiring that we use this as an implementation
> >> >> strategy). I do not think we need a separate (non aggregating)
> >> >> Grouping operation, this can be accomplished by having a concat-style
> >> >> combiner.
> >> >>
> >> >> There are still some open questions here, notably around how to
> >> >> specify the aggregation fns themselves. We could of course provide a
> >> >> number of built-ins (like SQL does). This gets into the question of
> >> >> how and where to document this complete set, but some basics should
> >> >> take us pretty far. Many aggregators, however, are parameterized
> (e.g.
> >> >> quantiles); where do we put the parameters? We could go with
> something
> >> >> like
> >> >>
> >> >> fn:
> >> >>   type: ApproximateQuantiles
> >> >>   config:
> >> >>     n: 10
> >> >>
> >> >> but others are even configured by functions themselves (e.g. LargestN
> >> >> that wants a comparator Fn). Maybe we decide not to support these
> >> >> (yet?)
> >> >>
> >> >> One thing I think we should support, however, is referencing custom
> >> >> CombineFns. We have some precedent for this with our Fns from
> >> >> MapToFields, where we accept things like inline lambdas and external
> >> >> references. Again the topic of how to configure them comes up, as
> >> >> these custom Fns are more likely to be parameterized than Map Fns
> >> >> (though, to be clear, perhaps it'd be good to allow parameterizatin
> of
> >> >> MapFns as well). Maybe we allow
> >> >>
> >> >> language: python. # like MapToFields (and here it'd be harder to mix
> >> >> and match per Fn)
> >> >> fn:
> >> >>   type: ???
> >> >>   # should these be nested as config?
> >> >>   name: fully.qualiied.name
> >> >>   path: /path/to/defining/file
> >> >>   args: [...]
> >> >>   kwargs: {...}
> >> >>
> >> >> which would invoke the constructor.
> >> >>
> >> >> I'm also open to other ways of naming/structuring these essential
> >> >> parameters if it makes things more clear.
> >> >>
> >> >> - Robert
> >> >>
> >> >>
> >> >> Java:
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html
> >> >> Python:
> https://beam.apache.org/documentation/transforms/python/aggregation/groupby
> >> >> Typescript:
> https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html
> >> >>
> >> >> [1] One can of course use SqlTransform for this, but I'm leaning
> >> >> towards offering something more native.
>

Re: [YAML] Aggregations

Reply via email to