Cube, Hierarchy Dimension and Measure are very common in DW/BI area, suppose the "cube modeler" has experience about that:)
But of cause, we should enhance Kylin's terminology page: http://kylin.incubator.apache.org/docs/gettingstarted/terminology.html Meanwhile, would like to recommend this one for reference: http://www.kimballgroup.com/2008/10/maintaining-dimension-hierarchies/ Hope these could bring a little bit help:) Thanks. Best Regards! --------------------- Luke Han On Wed, Sep 2, 2015 at 7:41 PM, Abhilash L L <[email protected]> wrote: > Thanks for explanations Hongbin and Li, > > We seem to have a decent understanding of hierarchical and derived > dimensions. > > For hierarchical, the columns part of the hierarchy also participate in > adding an extra level to cubiods. They become part of rowkey as well and > cubing happens on those columns as well. > > For dervied, the query is rewritten to use the join key and then the in > memory look up table is used to rewrite the hbase response to values with > the derived dimension. > > However there is something called a 'Normal' dimension (only one column > at a time), which we are trying to see how it works during query > resolution. Is this the mandatory dimension ? But since the UI allows only > column per 'Normal' dimension do we have to create one for each column ? > > > Also, a good write up about the types of dimensions and when to use each > type will be really helpful for users, who do not want get into the code to > figure out stuff. The clarification seeking requests might keep coming up > as well. Just a thought. > > > Regards, > Abhilash > > On Wed, Sep 2, 2015 at 2:57 PM, Li Yang <[email protected]> wrote: > > > Kylin assumes lookup table to be small (<100MB), thus can fit in memory. > > In your model, if order or customer go beyond millions, then they have to > > be on the fact table. Like Hongbin mentioned, an easy way is to use a > hive > > view. > > > > About analyzing ultra-high cardinality columns (like millions of > > customers), we see two common use cases. > > > > 1. TopN analysis. Returning a millions records is not useful at all, > > instread, returning the TopN big customer makes much better sense. > > KYLIN-943 <https://issues.apache.org/jira/browse/KYLIN-943> is a new > > feature under development that aims to respond to TopN queries in > > subsecond. > > > > 2. Focused analysis. Looking at a specific customer (e.g. where > > customer=A). Such query can be very fast by creating a cube with > customer > > as a Mandatory dimension. > > > > Cheers > > Yang > > > > On Tue, Sep 1, 2015 at 11:23 PM, hongbin ma <[email protected]> > wrote: > > > > > Kylin handles star schema well, but my encounter issues like OOM on > your > > > case. > > > How many large lookup tables do you have? > > > I'm not sure if a evict policy will help because anytime a SQL involves > > the > > > lookup table, the lookup table snapshot will have to be loaded again(so > > the > > > snapshots are swapping-in-swapping-out) > > > > > > One way to solve the problem is to join your tables into a flatten > table > > > using Hive view, providing Kylin with single big fact table. And please > > > notice avoid using dictionary on high cardinality columns. > > > > > > On Tue, Sep 1, 2015 at 11:16 PM, Abhilash L L <[email protected]> > > > wrote: > > > > > > > Thanks for replying Hongbin, > > > > > > > > for 1) we are trying to add some sort of evitction based cache > > > instead > > > > of a map. However, we still are trying to figure out what to do for > 3). > > > > > > > > What is the general advice ? The case here is .. I have order > > > details > > > > as a fact and order as a dimension and also customer. Now each of > these > > > > will run into many millions. Also, the f-key is not a long/bigint, > > its a > > > > string which is a combination of our custom columns. Making it a > > > dictionary > > > > will not work as we understand. Please suggest what should be the > > > approach > > > > taken > > > > > > > > Regards, > > > > Abhilash > > > > > > > > On Tue, Sep 1, 2015 at 4:37 PM, hongbin ma <[email protected]> > > wrote: > > > > > > > > > for 1) .. seems like only the resource path / table desc etc > is > > > only > > > > > kept in memory while a new lookupstringtable is created per > > > query/request > > > > > which holds onto data for the lifetime of the request. So once the > > > > request > > > > > is done, it should be garbage collectable ? > > > > > > > > > > /table is just for the hive table's schema, the look up table > content > > > is > > > > > cached in SnapshotManager and it will not be evicted so far. So if > > you > > > > have > > > > > a lot of large lookup tables this will be a problem > > > > > > > > > > > > > > > 3) Also the derived filter translator, is there a way to modify > the ' > > > > > IN_THRESHOLD' via config file ? > > > > > > > > > > Are you facing performance issue with a lot of IN clauses? if so , > > > please > > > > > take a look at https://issues.apache.org/jira/browse/KYLIN-740, > the > > > > patch > > > > > will be merged into next release > > > > > > > > > > On Mon, Aug 31, 2015 at 9:54 PM, Abhilash L L < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > Sorry for the confusion, > > > > > > > > > > > > for 1) .. seems like only the resource path / table desc etc > > is > > > > only > > > > > > kept in memory while a new lookupstringtable is created per > > > > query/request > > > > > > which holds onto data for the lifetime of the request. So once > the > > > > > request > > > > > > is done, it should be garbage collectable ? > > > > > > > > > > > > > > > > > > 3) Also the derived filter translator, is there a way to modify > > the ' > > > > > > IN_THRESHOLD' via config file ? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > Abhilash > > > > > > > > > > > > On Mon, Aug 31, 2015 at 7:05 PM, Abhilash L L < > > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > We started noticing that Kylin tomcat server is taking a > lot > > of > > > > > ram. > > > > > > > It even hit a limit of 10GB. > > > > > > > > > > > > > > After spending some time by going over the code, it seems > > like > > > > the > > > > > > > cube enumerator is not storing anything in memory. But the > Lookup > > > > table > > > > > > > enumerator seems to be loading all records and storing it in > > > memory. > > > > > > > > > > > > > > 1) What happens when there are lot of projects defined and > we > > > end > > > > > up > > > > > > > with tons of look up tables across them. Does it get swapped > out > > > > > > > automatically ? I am not able to track where eviction is > > > happening. > > > > > The > > > > > > > snapshot manager has a 'removeSnapshot' but its intent seems > > > > different > > > > > to > > > > > > > me. > > > > > > > > > > > > > > 2) How do we handle really higher cardinality dimension. > Eg: > > > If I > > > > > > have > > > > > > > sales as a fact and customers as a dimension, there will be > > > millions > > > > of > > > > > > > customers. However a store is good candidate to keep in memory > > but > > > > not > > > > > > > customers. Whats the recommended setting while creating the > cube > > to > > > > > > handle > > > > > > > such a case > > > > > > > > > > > > > > Regards, > > > > > > > Abhilash > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Regards, > > > > > > > > > > *Bin Mahone | 马洪宾* > > > > > Apache Kylin: http://kylin.io > > > > > Github: https://github.com/binmahone > > > > > > > > > > > > > > > > > > > > > -- > > > Regards, > > > > > > *Bin Mahone | 马洪宾* > > > Apache Kylin: http://kylin.io > > > Github: https://github.com/binmahone > > > > > >
