Re: Lookup Table Enumerator high memory

Luke Han Wed, 02 Sep 2015 07:02:59 -0700

Cube, Hierarchy Dimension and Measure are very common in DW/BI area,
suppose the "cube modeler" has experience about that:)


But of cause, we should enhance Kylin's terminology page:
http://kylin.incubator.apache.org/docs/gettingstarted/terminology.html

Meanwhile, would like to recommend this one for reference:
http://www.kimballgroup.com/2008/10/maintaining-dimension-hierarchies/

Hope these could bring a little bit help:)

Thanks.



Best Regards!
---------------------

Luke Han

On Wed, Sep 2, 2015 at 7:41 PM, Abhilash L L <[email protected]> wrote:

> Thanks for explanations Hongbin and Li,
>
>    We seem to have a decent understanding of hierarchical and derived
> dimensions.
>
>    For hierarchical, the columns part of the hierarchy also participate in
> adding an extra level to cubiods. They become part of rowkey as well and
> cubing happens on those columns as well.
>
>    For dervied, the query is rewritten to use the join key and then the in
> memory look up table is used to rewrite the hbase response to values with
> the derived dimension.
>
>    However there is something called a 'Normal' dimension (only one column
> at a time), which we are trying to see how it works during query
> resolution. Is this the mandatory dimension ? But since the UI allows only
> column per 'Normal' dimension do we have to create one for each column ?
>
>
>  Also, a good write up about the types of dimensions and when to use each
> type will be really helpful for users, who do not want get into the code to
> figure out stuff. The clarification seeking requests might keep coming up
> as well. Just a thought.
>
>
> Regards,
> Abhilash
>
> On Wed, Sep 2, 2015 at 2:57 PM, Li Yang <[email protected]> wrote:
>
> > Kylin assumes lookup table to be small (<100MB), thus can fit in memory.
> > In your model, if order or customer go beyond millions, then they have to
> > be on the fact table.  Like Hongbin mentioned, an easy way is to use a
> hive
> > view.
> >
> > About analyzing ultra-high cardinality columns (like millions of
> > customers), we see two common use cases.
> >
> > 1. TopN analysis.  Returning a millions records is not useful at all,
> > instread, returning the TopN big customer makes much better sense.
> > KYLIN-943 <https://issues.apache.org/jira/browse/KYLIN-943> is a new
> > feature under development that aims to respond to TopN queries in
> > subsecond.
> >
> > 2. Focused analysis.  Looking at a specific customer (e.g. where
> > customer=A).  Such query can be very fast by creating a cube with
> customer
> > as a Mandatory dimension.
> >
> > Cheers
> > Yang
> >
> > On Tue, Sep 1, 2015 at 11:23 PM, hongbin ma <[email protected]>
> wrote:
> >
> > > Kylin handles star schema well, but my encounter issues like OOM on
> your
> > > case.
> > > How many large lookup tables do you have?
> > > I'm not sure if a evict policy will help because anytime a SQL involves
> > the
> > > lookup table, the lookup table snapshot will have to be loaded again(so
> > the
> > > snapshots are swapping-in-swapping-out)
> > >
> > > One way to solve the problem is to join your tables into a flatten
> table
> > > using Hive view, providing Kylin with single big fact table. And please
> > > notice avoid using dictionary on high cardinality columns.
> > >
> > > On Tue, Sep 1, 2015 at 11:16 PM, Abhilash L L <[email protected]>
> > > wrote:
> > >
> > > > Thanks for replying Hongbin,
> > > >
> > > >      for 1) we are trying to add some sort of evitction based cache
> > > instead
> > > > of a map. However, we still are trying to figure out what to do for
> 3).
> > > >
> > > >     What is the general advice ? The case here is ..  I have order
> > > details
> > > > as a fact and order as a dimension and also customer. Now each of
> these
> > > > will run into many millions.  Also, the f-key is not a long/bigint,
> > its a
> > > > string which is a combination of our custom columns. Making it a
> > > dictionary
> > > > will not work as we understand. Please suggest what should be the
> > > approach
> > > > taken
> > > >
> > > > Regards,
> > > > Abhilash
> > > >
> > > > On Tue, Sep 1, 2015 at 4:37 PM, hongbin ma <[email protected]>
> > wrote:
> > > >
> > > > >     for 1) ..  seems like only the resource path / table desc etc
> is
> > > only
> > > > > kept in memory while a new lookupstringtable is created per
> > > query/request
> > > > > which holds onto data for the lifetime of the request.  So once the
> > > > request
> > > > > is done, it should be garbage collectable ?
> > > > >
> > > > > /table is just for the hive table's schema, the look up table
> content
> > > is
> > > > > cached in SnapshotManager and it will not be evicted so far. So if
> > you
> > > > have
> > > > > a lot of large lookup tables this will be a problem
> > > > >
> > > > >
> > > > > 3) Also the derived filter translator, is there a way to modify
> the '
> > > > > IN_THRESHOLD'  via config file ?
> > > > >
> > > > > Are you facing performance issue with a lot of IN clauses? if so ,
> > > please
> > > > > take a look at https://issues.apache.org/jira/browse/KYLIN-740,
> the
> > > > patch
> > > > > will be merged into next release
> > > > >
> > > > > On Mon, Aug 31, 2015 at 9:54 PM, Abhilash L L <
> [email protected]
> > >
> > > > > wrote:
> > > > >
> > > > > > Sorry for the confusion,
> > > > > >
> > > > > >     for 1) ..  seems like only the resource path / table desc etc
> > is
> > > > only
> > > > > > kept in memory while a new lookupstringtable is created per
> > > > query/request
> > > > > > which holds onto data for the lifetime of the request.  So once
> the
> > > > > request
> > > > > > is done, it should be garbage collectable ?
> > > > > >
> > > > > >
> > > > > > 3) Also the derived filter translator, is there a way to modify
> > the '
> > > > > > IN_THRESHOLD'  via config file ?
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Abhilash
> > > > > >
> > > > > > On Mon, Aug 31, 2015 at 7:05 PM, Abhilash L L <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > >     We started noticing that Kylin tomcat server is taking a
> lot
> > of
> > > > > ram.
> > > > > > > It even hit a limit of 10GB.
> > > > > > >
> > > > > > >     After spending some time by going over the code, it seems
> > like
> > > > the
> > > > > > > cube enumerator is not storing anything in memory. But the
> Lookup
> > > > table
> > > > > > > enumerator seems to be loading all records and storing it in
> > > memory.
> > > > > > >
> > > > > > >     1) What happens when there are lot of projects defined and
> we
> > > end
> > > > > up
> > > > > > > with tons of look up tables across them. Does it get swapped
> out
> > > > > > > automatically ?  I am not able to track where eviction is
> > > happening.
> > > > > The
> > > > > > > snapshot manager has a 'removeSnapshot' but its intent seems
> > > > different
> > > > > to
> > > > > > > me.
> > > > > > >
> > > > > > >     2) How do we handle really higher cardinality dimension.
> Eg:
> > > If I
> > > > > > have
> > > > > > > sales as a fact and customers as a dimension, there will be
> > > millions
> > > > of
> > > > > > > customers. However a store is good candidate to keep in memory
> > but
> > > > not
> > > > > > > customers. Whats the recommended setting while creating the
> cube
> > to
> > > > > > handle
> > > > > > > such a case
> > > > > > >
> > > > > > > Regards,
> > > > > > > Abhilash
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > *Bin Mahone | 马洪宾*
> > > > > Apache Kylin: http://kylin.io
> > > > > Github: https://github.com/binmahone
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > *Bin Mahone | 马洪宾*
> > > Apache Kylin: http://kylin.io
> > > Github: https://github.com/binmahone
> > >
> >
>

Re: Lookup Table Enumerator high memory

Reply via email to