Re: [YAML] Aggregations

Kenneth Knowles Thu, 19 Oct 2023 11:13:07 -0700

Using SQL expressions in strings is maybe OK given we are all
relational all the time. Either way you have to define what the
universe of `fn` is. Here's a compact possibility:


type: Combine
config:
  group_by: [field1, field2]
  aggregates:
    max_cost: "MAX(cost)"
    total_cost: "SUM(cost)"

Just a thought to get it closer to SQL concision. I also used the word
"Combine" just to connect it to other Beam writings and whatnot.

Kenn

On Thu, Oct 19, 2023 at 1:41 PM Robert Bradshaw via dev
<dev@beam.apache.org> wrote:
>
> On Thu, Oct 19, 2023 at 10:25 AM Jan Lukavský <je...@seznam.cz> wrote:
> >
> > On 10/19/23 18:28, Robert Bradshaw via dev wrote:
> > > On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis <byronel...@google.com> wrote:
> > >> Rill is definitely SQL-oriented but I think that's going to be the most 
> > >> common. Dataframes are explicitly modeled on the relational approach so 
> > >> that's going to look a lot like SQL,
> > > I think pretty much any approach that fits here is going to be
> > > relational, meaning you choose a set of columns to group on, a set of
> > > columns to aggregate, and how to aggregate. The big open question is
> > > what syntax to use for the "how."
> > This might be already answered, if so, pardon my ignorance, but what is
> > the goal this declarative approach is trying to solve? Is it meant to be
> > more expressive or equally expressive than SQL? And if more, how much more?
>
> I'm not sure if you're asking about YAML in general, or the particular
> case of aggregation, but I can answer both.
>
> For the larger Beam YAML project, it's trying to solve the problem
> that SQL is (and I'll admit this is somewhat subjective here) good at
> expressing the T part of ETL, but not the other parts. For example,
> the simple data movent usecase of (say) reading from PubSub and
> dumping into BigQuery is not well expressed in terms of SQL. SQL is
> also fairly awkward when it comes to defining UDFs and TDFs and
> non-linear pipelines (especially those with fanout). There are of
> course other tools in this space (dbt comes to mind, and there's been
> some investigation on how to make dbt play well with Beam). The other
> niche it is trying to solve is that installing and learning a full SDK
> is heavyweight and overkill for creating pipelines that are simply
> wiring together pre-defined transforms.
>
> As for the more narrow case of aggregations, I think being similarly
> expressive as SQL is fine, though it'd be good to make custom UADFs
> more natural. Originally I was thinking that just having SqlTransform
> might be sufficient, but it feels like a big hammer to reach for every
> time I just want to sum over one or two columns.

Re: [YAML] Aggregations

Reply via email to