Using SQL expressions in strings is maybe OK given we are all relational all the time. Either way you have to define what the universe of `fn` is. Here's a compact possibility:
type: Combine config: group_by: [field1, field2] aggregates: max_cost: "MAX(cost)" total_cost: "SUM(cost)" Just a thought to get it closer to SQL concision. I also used the word "Combine" just to connect it to other Beam writings and whatnot. Kenn On Thu, Oct 19, 2023 at 1:41 PM Robert Bradshaw via dev <dev@beam.apache.org> wrote: > > On Thu, Oct 19, 2023 at 10:25 AM Jan Lukavský <je...@seznam.cz> wrote: > > > > On 10/19/23 18:28, Robert Bradshaw via dev wrote: > > > On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis <byronel...@google.com> wrote: > > >> Rill is definitely SQL-oriented but I think that's going to be the most > > >> common. Dataframes are explicitly modeled on the relational approach so > > >> that's going to look a lot like SQL, > > > I think pretty much any approach that fits here is going to be > > > relational, meaning you choose a set of columns to group on, a set of > > > columns to aggregate, and how to aggregate. The big open question is > > > what syntax to use for the "how." > > This might be already answered, if so, pardon my ignorance, but what is > > the goal this declarative approach is trying to solve? Is it meant to be > > more expressive or equally expressive than SQL? And if more, how much more? > > I'm not sure if you're asking about YAML in general, or the particular > case of aggregation, but I can answer both. > > For the larger Beam YAML project, it's trying to solve the problem > that SQL is (and I'll admit this is somewhat subjective here) good at > expressing the T part of ETL, but not the other parts. For example, > the simple data movent usecase of (say) reading from PubSub and > dumping into BigQuery is not well expressed in terms of SQL. SQL is > also fairly awkward when it comes to defining UDFs and TDFs and > non-linear pipelines (especially those with fanout). There are of > course other tools in this space (dbt comes to mind, and there's been > some investigation on how to make dbt play well with Beam). The other > niche it is trying to solve is that installing and learning a full SDK > is heavyweight and overkill for creating pipelines that are simply > wiring together pre-defined transforms. > > As for the more narrow case of aggregations, I think being similarly > expressive as SQL is fine, though it'd be good to make custom UADFs > more natural. Originally I was thinking that just having SqlTransform > might be sufficient, but it feels like a big hammer to reach for every > time I just want to sum over one or two columns.