Kylin assumes lookup table to be small (<100MB), thus can fit in memory.
In your model, if order or customer go beyond millions, then they have to
be on the fact table.  Like Hongbin mentioned, an easy way is to use a hive
view.

About analyzing ultra-high cardinality columns (like millions of
customers), we see two common use cases.

1. TopN analysis.  Returning a millions records is not useful at all,
instread, returning the TopN big customer makes much better sense.
KYLIN-943 <https://issues.apache.org/jira/browse/KYLIN-943> is a new
feature under development that aims to respond to TopN queries in subsecond.

2. Focused analysis.  Looking at a specific customer (e.g. where
customer=A).  Such query can be very fast by creating a cube with customer
as a Mandatory dimension.

Cheers
Yang

On Tue, Sep 1, 2015 at 11:23 PM, hongbin ma <[email protected]> wrote:

> ​Kylin handles star schema well, but my encounter issues like OOM on your
> case.
> How many large lookup tables do you have?
> I'm not sure if a evict policy will help because anytime a SQL involves the
> lookup table, the lookup table snapshot will have to be loaded again(so the
> snapshots are swapping-in-swapping-out)
>
> One way to solve the problem is to join your tables into a flatten table
> using Hive view, providing Kylin with single big fact table. And please
> notice avoid using dictionary on high cardinality columns.
>
> On Tue, Sep 1, 2015 at 11:16 PM, Abhilash L L <[email protected]>
> wrote:
>
> > Thanks for replying Hongbin,
> >
> >      for 1) we are trying to add some sort of evitction based cache
> instead
> > of a map. However, we still are trying to figure out what to do for 3).
> >
> >     What is the general advice ? The case here is ..  I have order
> details
> > as a fact and order as a dimension and also customer. Now each of these
> > will run into many millions.  Also, the f-key is not a long/bigint, its a
> > string which is a combination of our custom columns. Making it a
> dictionary
> > will not work as we understand. Please suggest what should be the
> approach
> > taken
> >
> > Regards,
> > Abhilash
> >
> > On Tue, Sep 1, 2015 at 4:37 PM, hongbin ma <[email protected]> wrote:
> >
> > >     for 1) ..  seems like only the resource path / table desc etc is
> only
> > > kept in memory while a new lookupstringtable is created per
> query/request
> > > which holds onto data for the lifetime of the request.  So once the
> > request
> > > is done, it should be garbage collectable ?
> > >
> > > /table is just for the hive table's schema, the look up table content
> is
> > > cached in SnapshotManager and it will not be evicted so far. So if you
> > have
> > > a lot of large lookup tables this will be a problem
> > >
> > >
> > > 3) Also the derived filter translator, is there a way to modify the '
> > > IN_THRESHOLD'  via config file ?
> > >
> > > Are you facing performance issue with a lot of IN clauses? if so ,
> please
> > > take a look at https://issues.apache.org/jira/browse/KYLIN-740, the
> > patch
> > > will be merged into next release
> > >
> > > On Mon, Aug 31, 2015 at 9:54 PM, Abhilash L L <[email protected]>
> > > wrote:
> > >
> > > > Sorry for the confusion,
> > > >
> > > >     for 1) ..  seems like only the resource path / table desc etc is
> > only
> > > > kept in memory while a new lookupstringtable is created per
> > query/request
> > > > which holds onto data for the lifetime of the request.  So once the
> > > request
> > > > is done, it should be garbage collectable ?
> > > >
> > > >
> > > > 3) Also the derived filter translator, is there a way to modify the '
> > > > IN_THRESHOLD'  via config file ?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Regards,
> > > > Abhilash
> > > >
> > > > On Mon, Aug 31, 2015 at 7:05 PM, Abhilash L L <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > >     We started noticing that Kylin tomcat server is taking a lot of
> > > ram.
> > > > > It even hit a limit of 10GB.
> > > > >
> > > > >     After spending some time by going over the code, it seems like
> > the
> > > > > cube enumerator is not storing anything in memory. But the Lookup
> > table
> > > > > enumerator seems to be loading all records and storing it in
> memory.
> > > > >
> > > > >     1) What happens when there are lot of projects defined and we
> end
> > > up
> > > > > with tons of look up tables across them. Does it get swapped out
> > > > > automatically ?  I am not able to track where eviction is
> happening.
> > > The
> > > > > snapshot manager has a 'removeSnapshot' but its intent seems
> > different
> > > to
> > > > > me.
> > > > >
> > > > >     2) How do we handle really higher cardinality dimension. Eg:
> If I
> > > > have
> > > > > sales as a fact and customers as a dimension, there will be
> millions
> > of
> > > > > customers. However a store is good candidate to keep in memory but
> > not
> > > > > customers. Whats the recommended setting while creating the cube to
> > > > handle
> > > > > such a case
> > > > >
> > > > > Regards,
> > > > > Abhilash
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > *Bin Mahone | 马洪宾*
> > > Apache Kylin: http://kylin.io
> > > Github: https://github.com/binmahone
> > >
> >
>
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
> Apache Kylin: http://kylin.io
> Github: https://github.com/binmahone
>

Reply via email to