I’d like to ask a provocative question: Why does Kylin have hierarchies? There may be some good reasons, but having thought for a long time about OLAP architectures I have come to the conclusion that hierarchies can be more trouble than they are worth. I regret that I made them so central to Mondrian’s architecture; they are a part of the MDX language, so Mondrian had to have them in some form, but more of the system should have been built using attributes. Since Kylin is SQL-based, it doesn’t need hierarchies at all.
In OLAP, hierarchies are really useful in the presentation layer: a hierarchy is a drill path. If user has just expanded attribute A (e.g. Year) then they are very likely to want to expand attribute B (e.g. Month) or C (e.g. Week). So, hierarchies improve the user’s experience. In the engine and storage layer there are some concepts similar to hierarchies: functional dependencies (i.e. for a given value of X, column Y always has the same value), highly correlated columns (e.g. for a given value of zipcode, state almost always has the same value), and columns that are frequently aggregated together (e.g. a query rarely has “group by productName” but more often has “group by manufacturer, brand, productName”). These allow the kinds of storage optimization that hierarchies allow in Kylin, but they can be inferred without human intervention*, are more general, and less restrictive. For example, when choosing the set of cuboids you would tend to include highly correlated columns (if you have just built a cuboid using zipcode, there is a high benefit and low incremental cost to add state and nation to it because state is highly correlated and nation is functionally dependent). Same outcome has having an explicit (nation, state, zipcode) hierarchy. So, I am not claiming that hierarchies are not useful; I am claiming that they are not essential. If you were to remove explicit support for hierarchies and replace them with fuzzier concepts like highly correlated columns you might find that the system becomes radically simpler at its core. Forgive me for being provocative. I want to challenge assumptions. If the architecture is working fine, feel free to disregard. But if you are seeing signs to architectural strain, this might be an opportunity to simplify. Julian * Functional dependencies be inferred from the underlying star schema. Calcite’s aggregate designer discovers highly correlated columns with no human intervention, just by profiling the data; and columns that are frequently aggregated together could be discovered by looking at query logs. Kylin could do something similar.
