Either would be fine. Ideally Kylin would optimize storage automatically but, pragmatically, it’s reasonable to allow the user to supply hints. Hierarchies are a natural way for the user to supply hints but it seems to me that they shouldn’t exist in the deeper parts of the system.
Julian > On Jun 21, 2015, at 8:11 AM, Adunuthula, Seshu <[email protected]> wrote: > > Julian, > > Inferring implicit hierarchies from a highly correlated columns sounds > like an intriguing idea. Are you thinking Kylin auto infer that a set of > columns are correlated and allow for storage optimization or more of a > lazy specification of the hierarchies at the time of cuboid definition? > > Wanted to hear Yang¹s thoughts on this. > > Regards > Seshu > > On 6/19/15, 12:03 PM, "Julian Hyde" <[email protected]> wrote: > >> I¹d like to ask a provocative question: Why does Kylin have hierarchies? >> >> There may be some good reasons, but having thought for a long time about >> OLAP architectures I have come to the conclusion that hierarchies can be >> more trouble than they are worth. I regret that I made them so central to >> Mondrian¹s architecture; they are a part of the MDX language, so Mondrian >> had to have them in some form, but more of the system should have been >> built using attributes. Since Kylin is SQL-based, it doesn¹t need >> hierarchies at all. >> >> In OLAP, hierarchies are really useful in the presentation layer: a >> hierarchy is a drill path. If user has just expanded attribute A (e.g. >> Year) then they are very likely to want to expand attribute B (e.g. >> Month) or C (e.g. Week). So, hierarchies improve the user¹s experience. >> >> In the engine and storage layer there are some concepts similar to >> hierarchies: >> functional dependencies (i.e. for a given value of X, column Y always has >> the same value), >> highly correlated columns (e.g. for a given value of zipcode, state >> almost always has the same value), and >> columns that are frequently aggregated together (e.g. a query rarely has >> ³group by productName² but more often has ³group by manufacturer, brand, >> productName²). >> >> These allow the kinds of storage optimization that hierarchies allow in >> Kylin, but they can be inferred without human intervention*, are more >> general, and less restrictive. For example, when choosing the set of >> cuboids you would tend to include highly correlated columns (if you have >> just built a cuboid using zipcode, there is a high benefit and low >> incremental cost to add state and nation to it because state is highly >> correlated and nation is functionally dependent). Same outcome has having >> an explicit (nation, state, zipcode) hierarchy. >> >> So, I am not claiming that hierarchies are not useful; I am claiming that >> they are not essential. If you were to remove explicit support for >> hierarchies and replace them with fuzzier concepts like highly correlated >> columns you might find that the system becomes radically simpler at its >> core. >> >> Forgive me for being provocative. I want to challenge assumptions. If the >> architecture is working fine, feel free to disregard. But if you are >> seeing signs to architectural strain, this might be an opportunity to >> simplify. >> >> Julian >> >> * Functional dependencies be inferred from the underlying star schema. >> Calcite¹s aggregate designer discovers highly correlated columns with no >> human intervention, just by profiling the data; and columns that are >> frequently aggregated together could be discovered by looking at query >> logs. Kylin could do something similar. >
