Re: [Discussion] Measures that span multiple tables

Barak Alon Wed, 02 Oct 2024 06:52:57 -0700

Got it, that makes a lot of sense.

Common dimensions is the use case I was thinking of.


Our semantic layer exposes a SQL interface that allows for queries similar
to this:

```
select
  productId,
  sum_revenue,
  total_on_hand
from __all
where color = 'Red'
group by productId
```

It knows that sum_revenue comes from orders, total_on_hand comes from
inventory, and that productId and color are common dimensions. So it can
rewrite this query as:

```
select
  coalesce(o.productId, v.productId) as productId,
  o.sum_revenue,
  v.total_on_hand
from (
  select
    productId,
    sum_revenue
  from orders
  where color = 'Red'
  group by productId
) as o
full outer join (
  select
    productId,
    total_on_hand
  from inventory
  where color = 'Red'
  group by productId
) as v
  on o.productId = v.productId
```

So you could define measures on this __all abomination - say,
revenue_per_total = sum_revenue / total_on_hand.

This has a lot of similar benefits proposed in the paper. For example, this
is an even easier target for generative AI - the model doesn't need to know
how to join cubes.

re: your statement "I don’t think it would be a good idea to build a view
on this many-to-many join (too many columns, too much complexity)."

I want to dig into what such a view would look like. I'm imagining:

```
create view __all as
select
  coalesce(o.productId, v.productId) as productId,
  coalesce(o.color, v.color) as color,
  -- ... many more

  o.sum_revenue,
  v.total_on_hand,
  o.sum_revenue / v.total_on_hand as measure revenue_per_total,  -- a
measure that spans multiple tables
  -- ... many more

from o
full outer join v
  on o.productId = v.productId
  and o.color = v.color
  -- ... many more

-- .. many more
```

A lot of challenges here:
Efficiently dealing with so many columns?
Is it possible for an optimizer rule to prune the unused join conditions?
Is there sleeker syntax that would make defining such a view more tenable?

Does such complexity belong as a fundamental concept like Measures or in
some semantic layer with custom rewrite rules? I don't know. But I do think
there is value in such an abstraction, and it would be cool if it was
standardized.

On Tue, Oct 1, 2024 at 6:36 PM Julian Hyde <jhyde.apa...@gmail.com> wrote:

> Good question - thanks for asking.
>
> The short answer is that measures that span tables work, but they aren’t
> quite as convenient as measures in a single table.
>
> When designing Measures in SQL I knew that measures sometimes span tables
> - star schemas are a case in point - but decided to make measures belong to
> one table for simplicity. This allowed me to put them into the same
> namespace as columns, so I can write ’select e.avg_sal from emp as e’.
> (Measures are not columns, but SQL semantic analyzers already know how to
> look up a column within the scope of a table.)
>
> My goal was to allow dimensional expressions, such as revenue-this-year
> minus revenue-last-year, so we need to know what are the dimensions that
> affect the value of a measure. By associating a measure with a table we can
> say that the (non-measure) columns of that table are its dimensions.
>
> One case of 'measures spanning tables’ is joining a table with a measure
> to a regular table. For example, suppose the orders table has an
> avg_shipping_cost measure, and we write this query:
>
>   select o.zipcode, o.avg_shipping_cost, count(distinct i.color)
>   from orders as o
>     join orderItems as i on o.id <http://o.id/> = i.orderId
>   group by o.zipcode
>
> If I’d written avg(o.shipping_cost), orders with many items would be
> weighted more heavily in the total than orders with few items. But measures
> are ‘locked’ to the grain of their table.
>
> This is a crucial property of measures. It allows me to create a wide join
> view over a star schema - say of orders, order-items, customers and
> products - and the measures in that view will be well-behaved. Users get
> the benefits of the join without having to remember the right join
> conditions.
>
> By the way, if I really want a weighted average shipping, I can unpack
> convert the shipping-cost measure into a value, join it to the order-items,
> and then re-package it as a measure again:
>
>   select o.zipcode, avg(o.avg_shipping_cost) as measure
> weighted_avg_shipping_cost,
>       count(distinct i.color)
>   from orders as o
>     join orderItems as i on o.id <http://o.id/> = i.orderId
>   group by o.zipcode
>
> Now, another case of ‘measures spanning tables’ is common dimensions (also
> known as conformed dimensions). Let’s suppose I have an orders table and an
> inventory table, both with measures, and both with a date column (orderDate
> and inventoryDate) and a reference to a product. I can write a query that
> joins these tables:
>
>   select o.sum_revenue, v.total_on_hand
>   from orders as o
>     join inventory as v
>     on o.orderDate = v.inventoryDate
>     and  o.productId = v.productId
>   where o.color = ‘Red'
>
> This is a many-to-many join — like joining two cubes in an OLAP system —
> but the measures’ grain-locking ensures prevents double-counting.
>
> Notice that I have had to manually put the common dimensions into a join
> condition. There is nothing in the definition of the orders.sum_revenue and
> inventory.total_on_hand measures that connects their date dimensions. I
> have to remember to make that connection.
>
> Where to store those definitions? I don’t know. I don’t think it would be
> a good idea to build a view on this many-to-many join (too many columns,
> too much complexity).
>
> One idea is to store these common dimensions in the semantic layer.
> Perhaps a 'semantic layer' should have entities like ‘product’, ‘date’,
> ‘customer’ and record which columns on the ‘cube’ tables connect to each
> entity. And then the semantic layer can help people to write SQL. In my
> opinion, measures are not really the semantic layer — they are already
> there, as part of the tables, but their existence means that the semantic
> layer doesn’t have to worry about grain and double-counting, and so can
> generate simpler SQL.
>
> Julian
>
>
>
> > On Sep 30, 2024, at 3:00 PM, Barak Alon <barak.s.a...@gmail.com> wrote:
> >
> > Hey there -
> >
> > I've found my way here via Measures in SQL
> > <https://dl.acm.org/doi/pdf/10.1145/3626246.3653374>. I'm a long time
> > admirer of Calcite, but I'm not very familiar with internals - apologies
> if
> > I stumble.
> >
> > I work on Airbnb's Minerva
> > <
> https://medium.com/airbnb-engineering/how-airbnb-achieved-metric-consistency-at-scale-f23cc53dea70
> >
> > project,
> > an internal semantic layer for star schemas. Measures in SQL is a
> > wonderfully elegant idea, and I've started exploring if we can introduce
> > them to our system.
> >
> > However, our users often write queries that span multiple tables/stars,
> > requiring a drill across operation. Our system handles this by exposing a
> > virtual table that pretends as if our entire semantic layer is a single,
> > wide fact table. Incompatible measure/dimension combinations result in
> > runtime errors.
> >
> > This need feels pretty common and a logical extension to some of
> Measures'
> > strengths - making SQL more powerful and expressive for "top down"
> queries.
> >
> > I'm curious - are there plans to support measures that span multiple
> tables
> > in Calcite?
> >
> > - Barak
>
>

Re: [Discussion] Measures that span multiple tables

Reply via email to