On Thu, Oct 19, 2023 at 10:25 AM Jan Lukavský <je...@seznam.cz> wrote:
>
> On 10/19/23 18:28, Robert Bradshaw via dev wrote:
> > On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis <byronel...@google.com> wrote:
> >> Rill is definitely SQL-oriented but I think that's going to be the most 
> >> common. Dataframes are explicitly modeled on the relational approach so 
> >> that's going to look a lot like SQL,
> > I think pretty much any approach that fits here is going to be
> > relational, meaning you choose a set of columns to group on, a set of
> > columns to aggregate, and how to aggregate. The big open question is
> > what syntax to use for the "how."
> This might be already answered, if so, pardon my ignorance, but what is
> the goal this declarative approach is trying to solve? Is it meant to be
> more expressive or equally expressive than SQL? And if more, how much more?

I'm not sure if you're asking about YAML in general, or the particular
case of aggregation, but I can answer both.

For the larger Beam YAML project, it's trying to solve the problem
that SQL is (and I'll admit this is somewhat subjective here) good at
expressing the T part of ETL, but not the other parts. For example,
the simple data movent usecase of (say) reading from PubSub and
dumping into BigQuery is not well expressed in terms of SQL. SQL is
also fairly awkward when it comes to defining UDFs and TDFs and
non-linear pipelines (especially those with fanout). There are of
course other tools in this space (dbt comes to mind, and there's been
some investigation on how to make dbt play well with Beam). The other
niche it is trying to solve is that installing and learning a full SDK
is heavyweight and overkill for creating pipelines that are simply
wiring together pre-defined transforms.

As for the more narrow case of aggregations, I think being similarly
expressive as SQL is fine, though it'd be good to make custom UADFs
more natural. Originally I was thinking that just having SqlTransform
might be sufficient, but it feels like a big hammer to reach for every
time I just want to sum over one or two columns.

Reply via email to