Julian, Inferring implicit hierarchies from a highly correlated columns sounds like an intriguing idea. Are you thinking Kylin auto infer that a set of columns are correlated and allow for storage optimization or more of a lazy specification of the hierarchies at the time of cuboid definition?
Wanted to hear Yang¹s thoughts on this. Regards Seshu On 6/19/15, 12:03 PM, "Julian Hyde" <[email protected]> wrote: >I¹d like to ask a provocative question: Why does Kylin have hierarchies? > >There may be some good reasons, but having thought for a long time about >OLAP architectures I have come to the conclusion that hierarchies can be >more trouble than they are worth. I regret that I made them so central to >Mondrian¹s architecture; they are a part of the MDX language, so Mondrian >had to have them in some form, but more of the system should have been >built using attributes. Since Kylin is SQL-based, it doesn¹t need >hierarchies at all. > >In OLAP, hierarchies are really useful in the presentation layer: a >hierarchy is a drill path. If user has just expanded attribute A (e.g. >Year) then they are very likely to want to expand attribute B (e.g. >Month) or C (e.g. Week). So, hierarchies improve the user¹s experience. > >In the engine and storage layer there are some concepts similar to >hierarchies: >functional dependencies (i.e. for a given value of X, column Y always has >the same value), >highly correlated columns (e.g. for a given value of zipcode, state >almost always has the same value), and >columns that are frequently aggregated together (e.g. a query rarely has >³group by productName² but more often has ³group by manufacturer, brand, >productName²). > >These allow the kinds of storage optimization that hierarchies allow in >Kylin, but they can be inferred without human intervention*, are more >general, and less restrictive. For example, when choosing the set of >cuboids you would tend to include highly correlated columns (if you have >just built a cuboid using zipcode, there is a high benefit and low >incremental cost to add state and nation to it because state is highly >correlated and nation is functionally dependent). Same outcome has having >an explicit (nation, state, zipcode) hierarchy. > >So, I am not claiming that hierarchies are not useful; I am claiming that >they are not essential. If you were to remove explicit support for >hierarchies and replace them with fuzzier concepts like highly correlated >columns you might find that the system becomes radically simpler at its >core. > >Forgive me for being provocative. I want to challenge assumptions. If the >architecture is working fine, feel free to disregard. But if you are >seeing signs to architectural strain, this might be an opportunity to >simplify. > >Julian > >* Functional dependencies be inferred from the underlying star schema. >Calcite¹s aggregate designer discovers highly correlated columns with no >human intervention, just by profiling the data; and columns that are >frequently aggregated together could be discovered by looking at query >logs. Kylin could do something similar.
