Thanks for the clarification

We were wondering the same thing. For a given cuboid, query performance
will be very sensitive to the order of columns in the row key..   similar
to indexes in rdbms..

Regards,
Abhilash

On Thu, Sep 3, 2015 at 7:21 PM, Shi, Shaofeng <[email protected]> wrote:

> Hi Abhilash,
>
> “Mandantory” is a property on a row key column; You can see the option in
> the “Advanced” step; If a column is set to “Mandantory=true”, it will be
> moved to the head position of the row key; and that column will not be
> aggregated when calculating the cube. This will avoid unnecessary
> calculation and storage; If your query has where condition on that
> required column, the query performance will be very good;
>
> Let me give a sample; Assume I have a fact table which has the following
> dimensions: date, seller, country;
>
> Among them, date and country are low cardinality columns, seller is a high
> cardinality column; As almost all my queries are having seller specified,
> I set “seller” as mandatory in the row key, then this column is moved to
> the head of the row key, and will not be aggregated; The HBase row key
> will be like:
>
> seller1,cal_dt,country —>
> seller2,cal_dt,country —>
> seller3,cal_dt,country —>
> …
> sellerN,cal_dt,country —>
>
> seller1,cal_dt —>
> seller2,cal_dt —>
> seller3,cal_dt —>
> ...
> sellerN,cal_dt —>
>
> seller1,country —>
> seller2,country —>
> seller3,country —>
>
> ...
> sellerN,country —>
>
>
> As the seller’s cardinality is high, when given a seller value, the hbase
> scan range will be very small, then the query performance will be good;
>
> If you have SQLs which has no “seller” specified, in that case this cube
> may not provide same response time; We would suggest user to create
> another cube without seller dimension; Multiple cubes can co-exist in one
> project and Kylin will pick up the most-appropriate cube to serve the
> queries;
>
>
>
> On 9/2/15, 7:41 PM, "Abhilash L L" <[email protected]> wrote:
>
> >Thanks for explanations Hongbin and Li,
> >
> >   We seem to have a decent understanding of hierarchical and derived
> >dimensions.
> >
> >   For hierarchical, the columns part of the hierarchy also participate in
> >adding an extra level to cubiods. They become part of rowkey as well and
> >cubing happens on those columns as well.
> >
> >   For dervied, the query is rewritten to use the join key and then the in
> >memory look up table is used to rewrite the hbase response to values with
> >the derived dimension.
> >
> >   However there is something called a 'Normal' dimension (only one column
> >at a time), which we are trying to see how it works during query
> >resolution. Is this the mandatory dimension ? But since the UI allows only
> >column per 'Normal' dimension do we have to create one for each column ?
> >
> >
> > Also, a good write up about the types of dimensions and when to use each
> >type will be really helpful for users, who do not want get into the code
> >to
> >figure out stuff. The clarification seeking requests might keep coming up
> >as well. Just a thought.
> >
> >
> >Regards,
> >Abhilash
> >
> >On Wed, Sep 2, 2015 at 2:57 PM, Li Yang <[email protected]> wrote:
> >
> >> Kylin assumes lookup table to be small (<100MB), thus can fit in memory.
> >> In your model, if order or customer go beyond millions, then they have
> >>to
> >> be on the fact table.  Like Hongbin mentioned, an easy way is to use a
> >>hive
> >> view.
> >>
> >> About analyzing ultra-high cardinality columns (like millions of
> >> customers), we see two common use cases.
> >>
> >> 1. TopN analysis.  Returning a millions records is not useful at all,
> >> instread, returning the TopN big customer makes much better sense.
> >> KYLIN-943 <https://issues.apache.org/jira/browse/KYLIN-943> is a new
> >> feature under development that aims to respond to TopN queries in
> >> subsecond.
> >>
> >> 2. Focused analysis.  Looking at a specific customer (e.g. where
> >> customer=A).  Such query can be very fast by creating a cube with
> >>customer
> >> as a Mandatory dimension.
> >>
> >> Cheers
> >> Yang
> >>
> >> On Tue, Sep 1, 2015 at 11:23 PM, hongbin ma <[email protected]>
> >>wrote:
> >>
> >> > ​Kylin handles star schema well, but my encounter issues like OOM on
> >>your
> >> > case.
> >> > How many large lookup tables do you have?
> >> > I'm not sure if a evict policy will help because anytime a SQL
> >>involves
> >> the
> >> > lookup table, the lookup table snapshot will have to be loaded
> >>again(so
> >> the
> >> > snapshots are swapping-in-swapping-out)
> >> >
> >> > One way to solve the problem is to join your tables into a flatten
> >>table
> >> > using Hive view, providing Kylin with single big fact table. And
> >>please
> >> > notice avoid using dictionary on high cardinality columns.
> >> >
> >> > On Tue, Sep 1, 2015 at 11:16 PM, Abhilash L L <[email protected]>
> >> > wrote:
> >> >
> >> > > Thanks for replying Hongbin,
> >> > >
> >> > >      for 1) we are trying to add some sort of evitction based cache
> >> > instead
> >> > > of a map. However, we still are trying to figure out what to do for
> >>3).
> >> > >
> >> > >     What is the general advice ? The case here is ..  I have order
> >> > details
> >> > > as a fact and order as a dimension and also customer. Now each of
> >>these
> >> > > will run into many millions.  Also, the f-key is not a long/bigint,
> >> its a
> >> > > string which is a combination of our custom columns. Making it a
> >> > dictionary
> >> > > will not work as we understand. Please suggest what should be the
> >> > approach
> >> > > taken
> >> > >
> >> > > Regards,
> >> > > Abhilash
> >> > >
> >> > > On Tue, Sep 1, 2015 at 4:37 PM, hongbin ma <[email protected]>
> >> wrote:
> >> > >
> >> > > >     for 1) ..  seems like only the resource path / table desc etc
> >>is
> >> > only
> >> > > > kept in memory while a new lookupstringtable is created per
> >> > query/request
> >> > > > which holds onto data for the lifetime of the request.  So once
> >>the
> >> > > request
> >> > > > is done, it should be garbage collectable ?
> >> > > >
> >> > > > /table is just for the hive table's schema, the look up table
> >>content
> >> > is
> >> > > > cached in SnapshotManager and it will not be evicted so far. So if
> >> you
> >> > > have
> >> > > > a lot of large lookup tables this will be a problem
> >> > > >
> >> > > >
> >> > > > 3) Also the derived filter translator, is there a way to modify
> >>the '
> >> > > > IN_THRESHOLD'  via config file ?
> >> > > >
> >> > > > Are you facing performance issue with a lot of IN clauses? if so ,
> >> > please
> >> > > > take a look at https://issues.apache.org/jira/browse/KYLIN-740,
> >>the
> >> > > patch
> >> > > > will be merged into next release
> >> > > >
> >> > > > On Mon, Aug 31, 2015 at 9:54 PM, Abhilash L L
> >><[email protected]
> >> >
> >> > > > wrote:
> >> > > >
> >> > > > > Sorry for the confusion,
> >> > > > >
> >> > > > >     for 1) ..  seems like only the resource path / table desc
> >>etc
> >> is
> >> > > only
> >> > > > > kept in memory while a new lookupstringtable is created per
> >> > > query/request
> >> > > > > which holds onto data for the lifetime of the request.  So once
> >>the
> >> > > > request
> >> > > > > is done, it should be garbage collectable ?
> >> > > > >
> >> > > > >
> >> > > > > 3) Also the derived filter translator, is there a way to modify
> >> the '
> >> > > > > IN_THRESHOLD'  via config file ?
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > Regards,
> >> > > > > Abhilash
> >> > > > >
> >> > > > > On Mon, Aug 31, 2015 at 7:05 PM, Abhilash L L <
> >> [email protected]
> >> > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hello,
> >> > > > > >
> >> > > > > >     We started noticing that Kylin tomcat server is taking a
> >>lot
> >> of
> >> > > > ram.
> >> > > > > > It even hit a limit of 10GB.
> >> > > > > >
> >> > > > > >     After spending some time by going over the code, it seems
> >> like
> >> > > the
> >> > > > > > cube enumerator is not storing anything in memory. But the
> >>Lookup
> >> > > table
> >> > > > > > enumerator seems to be loading all records and storing it in
> >> > memory.
> >> > > > > >
> >> > > > > >     1) What happens when there are lot of projects defined
> >>and we
> >> > end
> >> > > > up
> >> > > > > > with tons of look up tables across them. Does it get swapped
> >>out
> >> > > > > > automatically ?  I am not able to track where eviction is
> >> > happening.
> >> > > > The
> >> > > > > > snapshot manager has a 'removeSnapshot' but its intent seems
> >> > > different
> >> > > > to
> >> > > > > > me.
> >> > > > > >
> >> > > > > >     2) How do we handle really higher cardinality dimension.
> >>Eg:
> >> > If I
> >> > > > > have
> >> > > > > > sales as a fact and customers as a dimension, there will be
> >> > millions
> >> > > of
> >> > > > > > customers. However a store is good candidate to keep in memory
> >> but
> >> > > not
> >> > > > > > customers. Whats the recommended setting while creating the
> >>cube
> >> to
> >> > > > > handle
> >> > > > > > such a case
> >> > > > > >
> >> > > > > > Regards,
> >> > > > > > Abhilash
> >> > > > > >
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Regards,
> >> > > >
> >> > > > *Bin Mahone | 马洪宾*
> >> > > > Apache Kylin: http://kylin.io
> >> > > > Github: https://github.com/binmahone
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Regards,
> >> >
> >> > *Bin Mahone | 马洪宾*
> >> > Apache Kylin: http://kylin.io
> >> > Github: https://github.com/binmahone
> >> >
> >>
>
>

Reply via email to