On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis <byronel...@google.com> wrote: > > Rill is definitely SQL-oriented but I think that's going to be the most > common. Dataframes are explicitly modeled on the relational approach so > that's going to look a lot like SQL,
I think pretty much any approach that fits here is going to be relational, meaning you choose a set of columns to group on, a set of columns to aggregate, and how to aggregate. The big open question is what syntax to use for the "how." Dataframe aggregation is probably a good example to look at. Here we have panda and R in particular as concrete instances. It should also be easy to support different aggregations over different (or the same) columns. Pandas can take a list of (or mapping to) functions in its groupby().agg(). R doesn't seem to make this very easy... > which leaves us with S-style formulas (which I like but are pretty niche) I'm curious, what are these? > and I guess pivot tables coming from the spreadsheet world. Does make me > wonder what Rails' ORM looks like these days (I last used v4), it had some > aggregation support and was pretty declarative... > > On Wed, Oct 18, 2023 at 6:06 PM Robert Bradshaw <rober...@google.com> wrote: >> >> On Wed, Oct 18, 2023 at 5:06 PM Byron Ellis <byronel...@google.com> wrote: >> > >> > Is it worth taking a look at similar prior art in the space? >> >> +1. Pointers welcome. >> >> > The first one that comes to mind is Transform, but with the dbt labs >> > acquisition that spec is a lot harder to find. Rill is pretty similar >> > though. >> >> Rill seems to be very SQL-based. >> >> > On Wed, Oct 18, 2023 at 1:12 PM Robert Bradshaw via dev >> > <dev@beam.apache.org> wrote: >> >> >> >> Beam Yaml has good support for IOs and mappings, but one key missing >> >> feature for even writing a WordCount is the ability to do Aggregations >> >> [1]. While the traditional Beam primitive is GroupByKey (and >> >> CombineValues), we're eschewing KVs in the notion of more schema'd >> >> data (which has some precedence in our other languages, see the links >> >> below). The key components the user needs to specify are (1) the key >> >> fields on which the grouping will take place, (2) the fields >> >> (expressions?) involved in the aggregation, and (3) what aggregating >> >> fn to use. >> >> >> >> A straw-man example could be something like >> >> >> >> type: Aggregating >> >> config: >> >> key: [field1, field2] >> >> aggregating: >> >> total_cost: >> >> fn: sum >> >> value: cost >> >> max_cost: >> >> fn: max >> >> value: cost >> >> >> >> This would basically correspond to the SQL expression >> >> >> >> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as max_cost >> >> from table GROUP BY field1, field2" >> >> >> >> (though I'm not requiring that we use this as an implementation >> >> strategy). I do not think we need a separate (non aggregating) >> >> Grouping operation, this can be accomplished by having a concat-style >> >> combiner. >> >> >> >> There are still some open questions here, notably around how to >> >> specify the aggregation fns themselves. We could of course provide a >> >> number of built-ins (like SQL does). This gets into the question of >> >> how and where to document this complete set, but some basics should >> >> take us pretty far. Many aggregators, however, are parameterized (e.g. >> >> quantiles); where do we put the parameters? We could go with something >> >> like >> >> >> >> fn: >> >> type: ApproximateQuantiles >> >> config: >> >> n: 10 >> >> >> >> but others are even configured by functions themselves (e.g. LargestN >> >> that wants a comparator Fn). Maybe we decide not to support these >> >> (yet?) >> >> >> >> One thing I think we should support, however, is referencing custom >> >> CombineFns. We have some precedent for this with our Fns from >> >> MapToFields, where we accept things like inline lambdas and external >> >> references. Again the topic of how to configure them comes up, as >> >> these custom Fns are more likely to be parameterized than Map Fns >> >> (though, to be clear, perhaps it'd be good to allow parameterizatin of >> >> MapFns as well). Maybe we allow >> >> >> >> language: python. # like MapToFields (and here it'd be harder to mix >> >> and match per Fn) >> >> fn: >> >> type: ??? >> >> # should these be nested as config? >> >> name: fully.qualiied.name >> >> path: /path/to/defining/file >> >> args: [...] >> >> kwargs: {...} >> >> >> >> which would invoke the constructor. >> >> >> >> I'm also open to other ways of naming/structuring these essential >> >> parameters if it makes things more clear. >> >> >> >> - Robert >> >> >> >> >> >> Java: >> >> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html >> >> Python: >> >> https://beam.apache.org/documentation/transforms/python/aggregation/groupby >> >> Typescript: >> >> https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html >> >> >> >> [1] One can of course use SqlTransform for this, but I'm leaning >> >> towards offering something more native.