Either would be fine. Ideally Kylin would optimize storage automatically but, 
pragmatically, it’s reasonable to allow the user to supply hints. Hierarchies 
are a natural way for the user to supply hints but it seems to me that they 
shouldn’t exist in the deeper parts of the system.

Julian


> On Jun 21, 2015, at 8:11 AM, Adunuthula, Seshu <[email protected]> wrote:
> 
> Julian,
> 
> Inferring implicit hierarchies from a highly correlated columns sounds
> like an intriguing idea. Are you thinking Kylin auto infer that a set of
> columns are correlated and allow for storage optimization or more of a
> lazy specification of the hierarchies at the time of cuboid definition?
> 
> Wanted to hear Yang¹s thoughts on this.
> 
> Regards
> Seshu
> 
> On 6/19/15, 12:03 PM, "Julian Hyde" <[email protected]> wrote:
> 
>> I¹d like to ask a provocative question: Why does Kylin have hierarchies?
>> 
>> There may be some good reasons, but having thought for a long time about
>> OLAP architectures I have come to the conclusion that hierarchies can be
>> more trouble than they are worth. I regret that I made them so central to
>> Mondrian¹s architecture; they are a part of the MDX language, so Mondrian
>> had to have them in some form, but more of the system should have been
>> built using attributes. Since Kylin is SQL-based, it doesn¹t need
>> hierarchies at all.
>> 
>> In OLAP, hierarchies are really useful in the presentation layer: a
>> hierarchy is a drill path. If user has just expanded attribute A (e.g.
>> Year) then they are very likely to want to expand attribute B (e.g.
>> Month) or C (e.g. Week). So, hierarchies improve the user¹s experience.
>> 
>> In the engine and storage layer there are some concepts similar to
>> hierarchies:
>> functional dependencies (i.e. for a given value of X, column Y always has
>> the same value),
>> highly correlated columns (e.g. for a given value of zipcode, state
>> almost always has the same value), and
>> columns that are frequently aggregated together (e.g. a query rarely has
>> ³group by productName² but more often has ³group by manufacturer, brand,
>> productName²).
>> 
>> These allow the kinds of storage optimization that hierarchies allow in
>> Kylin, but they can be inferred without human intervention*, are more
>> general, and less restrictive. For example, when choosing the set of
>> cuboids you would tend to include highly correlated columns (if you have
>> just built a cuboid using zipcode, there is a high benefit and low
>> incremental cost to add state and nation to it because state is highly
>> correlated and nation is functionally dependent). Same outcome has having
>> an explicit (nation, state, zipcode) hierarchy.
>> 
>> So, I am not claiming that hierarchies are not useful; I am claiming that
>> they are not essential. If you were to remove explicit support for
>> hierarchies and replace them with fuzzier concepts like highly correlated
>> columns you might find that the system becomes radically simpler at its
>> core.
>> 
>> Forgive me for being provocative. I want to challenge assumptions. If the
>> architecture is working fine, feel free to disregard. But if you are
>> seeing signs to architectural strain, this might be an opportunity to
>> simplify.
>> 
>> Julian
>> 
>> * Functional dependencies be inferred from the underlying star schema.
>> Calcite¹s aggregate designer discovers highly correlated columns with no
>> human intervention, just by profiling the data; and columns that are
>> frequently aggregated together could be discovered by looking at query
>> logs. Kylin could do something similar.
> 

Reply via email to