Thanks for the clarification We were wondering the same thing. For a given cuboid, query performance will be very sensitive to the order of columns in the row key.. similar to indexes in rdbms..
Regards, Abhilash On Thu, Sep 3, 2015 at 7:21 PM, Shi, Shaofeng <[email protected]> wrote: > Hi Abhilash, > > “Mandantory” is a property on a row key column; You can see the option in > the “Advanced” step; If a column is set to “Mandantory=true”, it will be > moved to the head position of the row key; and that column will not be > aggregated when calculating the cube. This will avoid unnecessary > calculation and storage; If your query has where condition on that > required column, the query performance will be very good; > > Let me give a sample; Assume I have a fact table which has the following > dimensions: date, seller, country; > > Among them, date and country are low cardinality columns, seller is a high > cardinality column; As almost all my queries are having seller specified, > I set “seller” as mandatory in the row key, then this column is moved to > the head of the row key, and will not be aggregated; The HBase row key > will be like: > > seller1,cal_dt,country —> > seller2,cal_dt,country —> > seller3,cal_dt,country —> > … > sellerN,cal_dt,country —> > > seller1,cal_dt —> > seller2,cal_dt —> > seller3,cal_dt —> > ... > sellerN,cal_dt —> > > seller1,country —> > seller2,country —> > seller3,country —> > > ... > sellerN,country —> > > > As the seller’s cardinality is high, when given a seller value, the hbase > scan range will be very small, then the query performance will be good; > > If you have SQLs which has no “seller” specified, in that case this cube > may not provide same response time; We would suggest user to create > another cube without seller dimension; Multiple cubes can co-exist in one > project and Kylin will pick up the most-appropriate cube to serve the > queries; > > > > On 9/2/15, 7:41 PM, "Abhilash L L" <[email protected]> wrote: > > >Thanks for explanations Hongbin and Li, > > > > We seem to have a decent understanding of hierarchical and derived > >dimensions. > > > > For hierarchical, the columns part of the hierarchy also participate in > >adding an extra level to cubiods. They become part of rowkey as well and > >cubing happens on those columns as well. > > > > For dervied, the query is rewritten to use the join key and then the in > >memory look up table is used to rewrite the hbase response to values with > >the derived dimension. > > > > However there is something called a 'Normal' dimension (only one column > >at a time), which we are trying to see how it works during query > >resolution. Is this the mandatory dimension ? But since the UI allows only > >column per 'Normal' dimension do we have to create one for each column ? > > > > > > Also, a good write up about the types of dimensions and when to use each > >type will be really helpful for users, who do not want get into the code > >to > >figure out stuff. The clarification seeking requests might keep coming up > >as well. Just a thought. > > > > > >Regards, > >Abhilash > > > >On Wed, Sep 2, 2015 at 2:57 PM, Li Yang <[email protected]> wrote: > > > >> Kylin assumes lookup table to be small (<100MB), thus can fit in memory. > >> In your model, if order or customer go beyond millions, then they have > >>to > >> be on the fact table. Like Hongbin mentioned, an easy way is to use a > >>hive > >> view. > >> > >> About analyzing ultra-high cardinality columns (like millions of > >> customers), we see two common use cases. > >> > >> 1. TopN analysis. Returning a millions records is not useful at all, > >> instread, returning the TopN big customer makes much better sense. > >> KYLIN-943 <https://issues.apache.org/jira/browse/KYLIN-943> is a new > >> feature under development that aims to respond to TopN queries in > >> subsecond. > >> > >> 2. Focused analysis. Looking at a specific customer (e.g. where > >> customer=A). Such query can be very fast by creating a cube with > >>customer > >> as a Mandatory dimension. > >> > >> Cheers > >> Yang > >> > >> On Tue, Sep 1, 2015 at 11:23 PM, hongbin ma <[email protected]> > >>wrote: > >> > >> > Kylin handles star schema well, but my encounter issues like OOM on > >>your > >> > case. > >> > How many large lookup tables do you have? > >> > I'm not sure if a evict policy will help because anytime a SQL > >>involves > >> the > >> > lookup table, the lookup table snapshot will have to be loaded > >>again(so > >> the > >> > snapshots are swapping-in-swapping-out) > >> > > >> > One way to solve the problem is to join your tables into a flatten > >>table > >> > using Hive view, providing Kylin with single big fact table. And > >>please > >> > notice avoid using dictionary on high cardinality columns. > >> > > >> > On Tue, Sep 1, 2015 at 11:16 PM, Abhilash L L <[email protected]> > >> > wrote: > >> > > >> > > Thanks for replying Hongbin, > >> > > > >> > > for 1) we are trying to add some sort of evitction based cache > >> > instead > >> > > of a map. However, we still are trying to figure out what to do for > >>3). > >> > > > >> > > What is the general advice ? The case here is .. I have order > >> > details > >> > > as a fact and order as a dimension and also customer. Now each of > >>these > >> > > will run into many millions. Also, the f-key is not a long/bigint, > >> its a > >> > > string which is a combination of our custom columns. Making it a > >> > dictionary > >> > > will not work as we understand. Please suggest what should be the > >> > approach > >> > > taken > >> > > > >> > > Regards, > >> > > Abhilash > >> > > > >> > > On Tue, Sep 1, 2015 at 4:37 PM, hongbin ma <[email protected]> > >> wrote: > >> > > > >> > > > for 1) .. seems like only the resource path / table desc etc > >>is > >> > only > >> > > > kept in memory while a new lookupstringtable is created per > >> > query/request > >> > > > which holds onto data for the lifetime of the request. So once > >>the > >> > > request > >> > > > is done, it should be garbage collectable ? > >> > > > > >> > > > /table is just for the hive table's schema, the look up table > >>content > >> > is > >> > > > cached in SnapshotManager and it will not be evicted so far. So if > >> you > >> > > have > >> > > > a lot of large lookup tables this will be a problem > >> > > > > >> > > > > >> > > > 3) Also the derived filter translator, is there a way to modify > >>the ' > >> > > > IN_THRESHOLD' via config file ? > >> > > > > >> > > > Are you facing performance issue with a lot of IN clauses? if so , > >> > please > >> > > > take a look at https://issues.apache.org/jira/browse/KYLIN-740, > >>the > >> > > patch > >> > > > will be merged into next release > >> > > > > >> > > > On Mon, Aug 31, 2015 at 9:54 PM, Abhilash L L > >><[email protected] > >> > > >> > > > wrote: > >> > > > > >> > > > > Sorry for the confusion, > >> > > > > > >> > > > > for 1) .. seems like only the resource path / table desc > >>etc > >> is > >> > > only > >> > > > > kept in memory while a new lookupstringtable is created per > >> > > query/request > >> > > > > which holds onto data for the lifetime of the request. So once > >>the > >> > > > request > >> > > > > is done, it should be garbage collectable ? > >> > > > > > >> > > > > > >> > > > > 3) Also the derived filter translator, is there a way to modify > >> the ' > >> > > > > IN_THRESHOLD' via config file ? > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > Regards, > >> > > > > Abhilash > >> > > > > > >> > > > > On Mon, Aug 31, 2015 at 7:05 PM, Abhilash L L < > >> [email protected] > >> > > > >> > > > > wrote: > >> > > > > > >> > > > > > Hello, > >> > > > > > > >> > > > > > We started noticing that Kylin tomcat server is taking a > >>lot > >> of > >> > > > ram. > >> > > > > > It even hit a limit of 10GB. > >> > > > > > > >> > > > > > After spending some time by going over the code, it seems > >> like > >> > > the > >> > > > > > cube enumerator is not storing anything in memory. But the > >>Lookup > >> > > table > >> > > > > > enumerator seems to be loading all records and storing it in > >> > memory. > >> > > > > > > >> > > > > > 1) What happens when there are lot of projects defined > >>and we > >> > end > >> > > > up > >> > > > > > with tons of look up tables across them. Does it get swapped > >>out > >> > > > > > automatically ? I am not able to track where eviction is > >> > happening. > >> > > > The > >> > > > > > snapshot manager has a 'removeSnapshot' but its intent seems > >> > > different > >> > > > to > >> > > > > > me. > >> > > > > > > >> > > > > > 2) How do we handle really higher cardinality dimension. > >>Eg: > >> > If I > >> > > > > have > >> > > > > > sales as a fact and customers as a dimension, there will be > >> > millions > >> > > of > >> > > > > > customers. However a store is good candidate to keep in memory > >> but > >> > > not > >> > > > > > customers. Whats the recommended setting while creating the > >>cube > >> to > >> > > > > handle > >> > > > > > such a case > >> > > > > > > >> > > > > > Regards, > >> > > > > > Abhilash > >> > > > > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > > -- > >> > > > Regards, > >> > > > > >> > > > *Bin Mahone | 马洪宾* > >> > > > Apache Kylin: http://kylin.io > >> > > > Github: https://github.com/binmahone > >> > > > > >> > > > >> > > >> > > >> > > >> > -- > >> > Regards, > >> > > >> > *Bin Mahone | 马洪宾* > >> > Apache Kylin: http://kylin.io > >> > Github: https://github.com/binmahone > >> > > >> > >
