Thanks for explanations Hongbin and Li, We seem to have a decent understanding of hierarchical and derived dimensions.
For hierarchical, the columns part of the hierarchy also participate in adding an extra level to cubiods. They become part of rowkey as well and cubing happens on those columns as well. For dervied, the query is rewritten to use the join key and then the in memory look up table is used to rewrite the hbase response to values with the derived dimension. However there is something called a 'Normal' dimension (only one column at a time), which we are trying to see how it works during query resolution. Is this the mandatory dimension ? But since the UI allows only column per 'Normal' dimension do we have to create one for each column ? Also, a good write up about the types of dimensions and when to use each type will be really helpful for users, who do not want get into the code to figure out stuff. The clarification seeking requests might keep coming up as well. Just a thought. Regards, Abhilash On Wed, Sep 2, 2015 at 2:57 PM, Li Yang <[email protected]> wrote: > Kylin assumes lookup table to be small (<100MB), thus can fit in memory. > In your model, if order or customer go beyond millions, then they have to > be on the fact table. Like Hongbin mentioned, an easy way is to use a hive > view. > > About analyzing ultra-high cardinality columns (like millions of > customers), we see two common use cases. > > 1. TopN analysis. Returning a millions records is not useful at all, > instread, returning the TopN big customer makes much better sense. > KYLIN-943 <https://issues.apache.org/jira/browse/KYLIN-943> is a new > feature under development that aims to respond to TopN queries in > subsecond. > > 2. Focused analysis. Looking at a specific customer (e.g. where > customer=A). Such query can be very fast by creating a cube with customer > as a Mandatory dimension. > > Cheers > Yang > > On Tue, Sep 1, 2015 at 11:23 PM, hongbin ma <[email protected]> wrote: > > > Kylin handles star schema well, but my encounter issues like OOM on your > > case. > > How many large lookup tables do you have? > > I'm not sure if a evict policy will help because anytime a SQL involves > the > > lookup table, the lookup table snapshot will have to be loaded again(so > the > > snapshots are swapping-in-swapping-out) > > > > One way to solve the problem is to join your tables into a flatten table > > using Hive view, providing Kylin with single big fact table. And please > > notice avoid using dictionary on high cardinality columns. > > > > On Tue, Sep 1, 2015 at 11:16 PM, Abhilash L L <[email protected]> > > wrote: > > > > > Thanks for replying Hongbin, > > > > > > for 1) we are trying to add some sort of evitction based cache > > instead > > > of a map. However, we still are trying to figure out what to do for 3). > > > > > > What is the general advice ? The case here is .. I have order > > details > > > as a fact and order as a dimension and also customer. Now each of these > > > will run into many millions. Also, the f-key is not a long/bigint, > its a > > > string which is a combination of our custom columns. Making it a > > dictionary > > > will not work as we understand. Please suggest what should be the > > approach > > > taken > > > > > > Regards, > > > Abhilash > > > > > > On Tue, Sep 1, 2015 at 4:37 PM, hongbin ma <[email protected]> > wrote: > > > > > > > for 1) .. seems like only the resource path / table desc etc is > > only > > > > kept in memory while a new lookupstringtable is created per > > query/request > > > > which holds onto data for the lifetime of the request. So once the > > > request > > > > is done, it should be garbage collectable ? > > > > > > > > /table is just for the hive table's schema, the look up table content > > is > > > > cached in SnapshotManager and it will not be evicted so far. So if > you > > > have > > > > a lot of large lookup tables this will be a problem > > > > > > > > > > > > 3) Also the derived filter translator, is there a way to modify the ' > > > > IN_THRESHOLD' via config file ? > > > > > > > > Are you facing performance issue with a lot of IN clauses? if so , > > please > > > > take a look at https://issues.apache.org/jira/browse/KYLIN-740, the > > > patch > > > > will be merged into next release > > > > > > > > On Mon, Aug 31, 2015 at 9:54 PM, Abhilash L L <[email protected] > > > > > > wrote: > > > > > > > > > Sorry for the confusion, > > > > > > > > > > for 1) .. seems like only the resource path / table desc etc > is > > > only > > > > > kept in memory while a new lookupstringtable is created per > > > query/request > > > > > which holds onto data for the lifetime of the request. So once the > > > > request > > > > > is done, it should be garbage collectable ? > > > > > > > > > > > > > > > 3) Also the derived filter translator, is there a way to modify > the ' > > > > > IN_THRESHOLD' via config file ? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > Abhilash > > > > > > > > > > On Mon, Aug 31, 2015 at 7:05 PM, Abhilash L L < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > We started noticing that Kylin tomcat server is taking a lot > of > > > > ram. > > > > > > It even hit a limit of 10GB. > > > > > > > > > > > > After spending some time by going over the code, it seems > like > > > the > > > > > > cube enumerator is not storing anything in memory. But the Lookup > > > table > > > > > > enumerator seems to be loading all records and storing it in > > memory. > > > > > > > > > > > > 1) What happens when there are lot of projects defined and we > > end > > > > up > > > > > > with tons of look up tables across them. Does it get swapped out > > > > > > automatically ? I am not able to track where eviction is > > happening. > > > > The > > > > > > snapshot manager has a 'removeSnapshot' but its intent seems > > > different > > > > to > > > > > > me. > > > > > > > > > > > > 2) How do we handle really higher cardinality dimension. Eg: > > If I > > > > > have > > > > > > sales as a fact and customers as a dimension, there will be > > millions > > > of > > > > > > customers. However a store is good candidate to keep in memory > but > > > not > > > > > > customers. Whats the recommended setting while creating the cube > to > > > > > handle > > > > > > such a case > > > > > > > > > > > > Regards, > > > > > > Abhilash > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Regards, > > > > > > > > *Bin Mahone | 马洪宾* > > > > Apache Kylin: http://kylin.io > > > > Github: https://github.com/binmahone > > > > > > > > > > > > > > > -- > > Regards, > > > > *Bin Mahone | 马洪宾* > > Apache Kylin: http://kylin.io > > Github: https://github.com/binmahone > > >
