Re: [YAML] Aggregations

Robert Bradshaw via dev Thu, 19 Oct 2023 09:28:42 -0700

On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis <[email protected]> wrote:
>
> Rill is definitely SQL-oriented but I think that's going to be the most 
> common. Dataframes are explicitly modeled on the relational approach so 
> that's going to look a lot like SQL,


I think pretty much any approach that fits here is going to be
relational, meaning you choose a set of columns to group on, a set of
columns to aggregate, and how to aggregate. The big open question is
what syntax to use for the "how."

Dataframe aggregation is probably a good example to look at. Here we
have panda and R in particular as concrete instances. It should also
be easy to support different aggregations over different (or the same)
columns. Pandas can take a list of (or mapping to) functions in its
groupby().agg(). R doesn't seem to make this very easy...

> which leaves us with S-style formulas (which I like but are pretty niche)

I'm curious, what are these?

>  and I guess pivot tables coming from the spreadsheet world. Does make me 
> wonder what Rails' ORM looks like these days (I last used v4), it had some 
> aggregation support and was pretty declarative...
>
> On Wed, Oct 18, 2023 at 6:06 PM Robert Bradshaw <[email protected]> wrote:
>>
>> On Wed, Oct 18, 2023 at 5:06 PM Byron Ellis <[email protected]> wrote:
>> >
>> > Is it worth taking a look at similar prior art in the space?
>>
>> +1. Pointers welcome.
>>
>> > The first one that comes to mind is Transform, but with the dbt labs 
>> > acquisition that spec is a lot harder to find. Rill is pretty similar 
>> > though.
>>
>> Rill seems to be very SQL-based.
>>
>> > On Wed, Oct 18, 2023 at 1:12 PM Robert Bradshaw via dev 
>> > <[email protected]> wrote:
>> >>
>> >> Beam Yaml has good support for IOs and mappings, but one key missing
>> >> feature for even writing a WordCount is the ability to do Aggregations
>> >> [1]. While the traditional Beam primitive is GroupByKey (and
>> >> CombineValues), we're eschewing KVs in the notion of more schema'd
>> >> data (which has some precedence in our other languages, see the links
>> >> below). The key components the user needs to specify are (1) the key
>> >> fields on which the grouping will take place, (2) the fields
>> >> (expressions?) involved in the aggregation, and (3) what aggregating
>> >> fn to use.
>> >>
>> >> A straw-man example could be something like
>> >>
>> >> type: Aggregating
>> >> config:
>> >>   key: [field1, field2]
>> >>   aggregating:
>> >>     total_cost:
>> >>       fn: sum
>> >>       value: cost
>> >>     max_cost:
>> >>       fn: max
>> >>       value: cost
>> >>
>> >> This would basically correspond to the SQL expression
>> >>
>> >> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as max_cost
>> >> from table GROUP BY field1, field2"
>> >>
>> >> (though I'm not requiring that we use this as an implementation
>> >> strategy). I do not think we need a separate (non aggregating)
>> >> Grouping operation, this can be accomplished by having a concat-style
>> >> combiner.
>> >>
>> >> There are still some open questions here, notably around how to
>> >> specify the aggregation fns themselves. We could of course provide a
>> >> number of built-ins (like SQL does). This gets into the question of
>> >> how and where to document this complete set, but some basics should
>> >> take us pretty far. Many aggregators, however, are parameterized (e.g.
>> >> quantiles); where do we put the parameters? We could go with something
>> >> like
>> >>
>> >> fn:
>> >>   type: ApproximateQuantiles
>> >>   config:
>> >>     n: 10
>> >>
>> >> but others are even configured by functions themselves (e.g. LargestN
>> >> that wants a comparator Fn). Maybe we decide not to support these
>> >> (yet?)
>> >>
>> >> One thing I think we should support, however, is referencing custom
>> >> CombineFns. We have some precedent for this with our Fns from
>> >> MapToFields, where we accept things like inline lambdas and external
>> >> references. Again the topic of how to configure them comes up, as
>> >> these custom Fns are more likely to be parameterized than Map Fns
>> >> (though, to be clear, perhaps it'd be good to allow parameterizatin of
>> >> MapFns as well). Maybe we allow
>> >>
>> >> language: python. # like MapToFields (and here it'd be harder to mix
>> >> and match per Fn)
>> >> fn:
>> >>   type: ???
>> >>   # should these be nested as config?
>> >>   name: fully.qualiied.name
>> >>   path: /path/to/defining/file
>> >>   args: [...]
>> >>   kwargs: {...}
>> >>
>> >> which would invoke the constructor.
>> >>
>> >> I'm also open to other ways of naming/structuring these essential
>> >> parameters if it makes things more clear.
>> >>
>> >> - Robert
>> >>
>> >>
>> >> Java: 
>> >> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html
>> >> Python: 
>> >> https://beam.apache.org/documentation/transforms/python/aggregation/groupby
>> >> Typescript: 
>> >> https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html
>> >>
>> >> [1] One can of course use SqlTransform for this, but I'm leaning
>> >> towards offering something more native.

Re: [YAML] Aggregations

Reply via email to