I’d like to ask a provocative question: Why does Kylin have hierarchies?

There may be some good reasons, but having thought for a long time about OLAP 
architectures I have come to the conclusion that hierarchies can be more 
trouble than they are worth. I regret that I made them so central to Mondrian’s 
architecture; they are a part of the MDX language, so Mondrian had to have them 
in some form, but more of the system should have been built using attributes. 
Since Kylin is SQL-based, it doesn’t need hierarchies at all.

In OLAP, hierarchies are really useful in the presentation layer: a hierarchy 
is a drill path. If user has just expanded attribute A (e.g. Year) then they 
are very likely to want to expand attribute B (e.g. Month) or C (e.g. Week). 
So, hierarchies improve the user’s experience.

In the engine and storage layer there are some concepts similar to hierarchies:
functional dependencies (i.e. for a given value of X, column Y always has the 
same value),
highly correlated columns (e.g. for a given value of zipcode, state almost 
always has the same value), and
columns that are frequently aggregated together (e.g. a query rarely has “group 
by productName” but more often has “group by manufacturer, brand, productName”).

These allow the kinds of storage optimization that hierarchies allow in Kylin, 
but they can be inferred without human intervention*, are more general, and 
less restrictive. For example, when choosing the set of cuboids you would tend 
to include highly correlated columns (if you have just built a cuboid using 
zipcode, there is a high benefit and low incremental cost to add state and 
nation to it because state is highly correlated and nation is functionally 
dependent). Same outcome has having an explicit (nation, state, zipcode) 
hierarchy.

So, I am not claiming that hierarchies are not useful; I am claiming that they 
are not essential. If you were to remove explicit support for hierarchies and 
replace them with fuzzier concepts like highly correlated columns you might 
find that the system becomes radically simpler at its core.

Forgive me for being provocative. I want to challenge assumptions. If the 
architecture is working fine, feel free to disregard. But if you are seeing 
signs to architectural strain, this might be an opportunity to simplify.

Julian

* Functional dependencies be inferred from the underlying star schema. 
Calcite’s aggregate designer discovers highly correlated columns with no human 
intervention, just by profiling the data; and columns that are frequently 
aggregated together could be discovered by looking at query logs. Kylin could 
do something similar.

Reply via email to