Got it, that makes a lot of sense. Common dimensions is the use case I was thinking of.
Our semantic layer exposes a SQL interface that allows for queries similar to this: ``` select productId, sum_revenue, total_on_hand from __all where color = 'Red' group by productId ``` It knows that sum_revenue comes from orders, total_on_hand comes from inventory, and that productId and color are common dimensions. So it can rewrite this query as: ``` select coalesce(o.productId, v.productId) as productId, o.sum_revenue, v.total_on_hand from ( select productId, sum_revenue from orders where color = 'Red' group by productId ) as o full outer join ( select productId, total_on_hand from inventory where color = 'Red' group by productId ) as v on o.productId = v.productId ``` So you could define measures on this __all abomination - say, revenue_per_total = sum_revenue / total_on_hand. This has a lot of similar benefits proposed in the paper. For example, this is an even easier target for generative AI - the model doesn't need to know how to join cubes. re: your statement "I don’t think it would be a good idea to build a view on this many-to-many join (too many columns, too much complexity)." I want to dig into what such a view would look like. I'm imagining: ``` create view __all as select coalesce(o.productId, v.productId) as productId, coalesce(o.color, v.color) as color, -- ... many more o.sum_revenue, v.total_on_hand, o.sum_revenue / v.total_on_hand as measure revenue_per_total, -- a measure that spans multiple tables -- ... many more from o full outer join v on o.productId = v.productId and o.color = v.color -- ... many more -- .. many more ``` A lot of challenges here: Efficiently dealing with so many columns? Is it possible for an optimizer rule to prune the unused join conditions? Is there sleeker syntax that would make defining such a view more tenable? Does such complexity belong as a fundamental concept like Measures or in some semantic layer with custom rewrite rules? I don't know. But I do think there is value in such an abstraction, and it would be cool if it was standardized. On Tue, Oct 1, 2024 at 6:36 PM Julian Hyde <jhyde.apa...@gmail.com> wrote: > Good question - thanks for asking. > > The short answer is that measures that span tables work, but they aren’t > quite as convenient as measures in a single table. > > When designing Measures in SQL I knew that measures sometimes span tables > - star schemas are a case in point - but decided to make measures belong to > one table for simplicity. This allowed me to put them into the same > namespace as columns, so I can write ’select e.avg_sal from emp as e’. > (Measures are not columns, but SQL semantic analyzers already know how to > look up a column within the scope of a table.) > > My goal was to allow dimensional expressions, such as revenue-this-year > minus revenue-last-year, so we need to know what are the dimensions that > affect the value of a measure. By associating a measure with a table we can > say that the (non-measure) columns of that table are its dimensions. > > One case of 'measures spanning tables’ is joining a table with a measure > to a regular table. For example, suppose the orders table has an > avg_shipping_cost measure, and we write this query: > > select o.zipcode, o.avg_shipping_cost, count(distinct i.color) > from orders as o > join orderItems as i on o.id <http://o.id/> = i.orderId > group by o.zipcode > > If I’d written avg(o.shipping_cost), orders with many items would be > weighted more heavily in the total than orders with few items. But measures > are ‘locked’ to the grain of their table. > > This is a crucial property of measures. It allows me to create a wide join > view over a star schema - say of orders, order-items, customers and > products - and the measures in that view will be well-behaved. Users get > the benefits of the join without having to remember the right join > conditions. > > By the way, if I really want a weighted average shipping, I can unpack > convert the shipping-cost measure into a value, join it to the order-items, > and then re-package it as a measure again: > > select o.zipcode, avg(o.avg_shipping_cost) as measure > weighted_avg_shipping_cost, > count(distinct i.color) > from orders as o > join orderItems as i on o.id <http://o.id/> = i.orderId > group by o.zipcode > > Now, another case of ‘measures spanning tables’ is common dimensions (also > known as conformed dimensions). Let’s suppose I have an orders table and an > inventory table, both with measures, and both with a date column (orderDate > and inventoryDate) and a reference to a product. I can write a query that > joins these tables: > > select o.sum_revenue, v.total_on_hand > from orders as o > join inventory as v > on o.orderDate = v.inventoryDate > and o.productId = v.productId > where o.color = ‘Red' > > This is a many-to-many join — like joining two cubes in an OLAP system — > but the measures’ grain-locking ensures prevents double-counting. > > Notice that I have had to manually put the common dimensions into a join > condition. There is nothing in the definition of the orders.sum_revenue and > inventory.total_on_hand measures that connects their date dimensions. I > have to remember to make that connection. > > Where to store those definitions? I don’t know. I don’t think it would be > a good idea to build a view on this many-to-many join (too many columns, > too much complexity). > > One idea is to store these common dimensions in the semantic layer. > Perhaps a 'semantic layer' should have entities like ‘product’, ‘date’, > ‘customer’ and record which columns on the ‘cube’ tables connect to each > entity. And then the semantic layer can help people to write SQL. In my > opinion, measures are not really the semantic layer — they are already > there, as part of the tables, but their existence means that the semantic > layer doesn’t have to worry about grain and double-counting, and so can > generate simpler SQL. > > Julian > > > > > On Sep 30, 2024, at 3:00 PM, Barak Alon <barak.s.a...@gmail.com> wrote: > > > > Hey there - > > > > I've found my way here via Measures in SQL > > <https://dl.acm.org/doi/pdf/10.1145/3626246.3653374>. I'm a long time > > admirer of Calcite, but I'm not very familiar with internals - apologies > if > > I stumble. > > > > I work on Airbnb's Minerva > > < > https://medium.com/airbnb-engineering/how-airbnb-achieved-metric-consistency-at-scale-f23cc53dea70 > > > > project, > > an internal semantic layer for star schemas. Measures in SQL is a > > wonderfully elegant idea, and I've started exploring if we can introduce > > them to our system. > > > > However, our users often write queries that span multiple tables/stars, > > requiring a drill across operation. Our system handles this by exposing a > > virtual table that pretends as if our entire semantic layer is a single, > > wide fact table. Incompatible measure/dimension combinations result in > > runtime errors. > > > > This need feels pretty common and a logical extension to some of > Measures' > > strengths - making SQL more powerful and expressive for "top down" > queries. > > > > I'm curious - are there plans to support measures that span multiple > tables > > in Calcite? > > > > - Barak > >