Re: Lookup Table Enumerator high memory

Abhilash L L Wed, 02 Sep 2015 21:01:51 -0700

'For the suggestion of 'flattening' the customer into the order table (fact
table), need a few clarifications/suggestions


Lets say we have flattened the customers into the fact table

To get number of customers, we can create a normal dimension on the
'customer id' in the fact table. Customer id becomes part of rowkey.

How do we get attributes like 'name' or 'age' for a given customer id.

   Adding a dummy measure on every customer column doesnt make sense when
queried without customer in group by. Also leads to lot of duplicate data
on hbase. If we try to query without group by we get an error '<colname> does
not exist in row key desc'.

   We cant do anything similar to 'derived dimension' on a fact table as
its only possible on lookup tables. Also will create snapshot etc.


Regards,
Abhilash

On Wed, Sep 2, 2015 at 7:48 PM, Abhilash L L <[email protected]> wrote:

> Hi Luke,
>
>  I was mainly referring behaviour of the hierarchical, derived and Normal
> types within Kylin.  Especially derived and normal, the effect of using
> these is not very apparent..   especially since there are nochanges in row
> key design etc..
>
> Regards,
> Abhilash
>
> On Wed, Sep 2, 2015 at 7:32 PM, Luke Han <[email protected]> wrote:
>
>> Cube, Hierarchy Dimension and Measure are very common in DW/BI area,
>> suppose the "cube modeler" has experience about that:)
>>
>> But of cause, we should enhance Kylin's terminology page:
>> http://kylin.incubator.apache.org/docs/gettingstarted/terminology.html
>>
>> Meanwhile, would like to recommend this one for reference:
>> http://www.kimballgroup.com/2008/10/maintaining-dimension-hierarchies/
>>
>> Hope these could bring a little bit help:)
>>
>> Thanks.
>>
>>
>>
>> Best Regards!
>> ---------------------
>>
>> Luke Han
>>
>> On Wed, Sep 2, 2015 at 7:41 PM, Abhilash L L <[email protected]>
>> wrote:
>>
>> > Thanks for explanations Hongbin and Li,
>> >
>> >    We seem to have a decent understanding of hierarchical and derived
>> > dimensions.
>> >
>> >    For hierarchical, the columns part of the hierarchy also participate
>> in
>> > adding an extra level to cubiods. They become part of rowkey as well and
>> > cubing happens on those columns as well.
>> >
>> >    For dervied, the query is rewritten to use the join key and then the
>> in
>> > memory look up table is used to rewrite the hbase response to values
>> with
>> > the derived dimension.
>> >
>> >    However there is something called a 'Normal' dimension (only one
>> column
>> > at a time), which we are trying to see how it works during query
>> > resolution. Is this the mandatory dimension ? But since the UI allows
>> only
>> > column per 'Normal' dimension do we have to create one for each column ?
>> >
>> >
>> >  Also, a good write up about the types of dimensions and when to use
>> each
>> > type will be really helpful for users, who do not want get into the
>> code to
>> > figure out stuff. The clarification seeking requests might keep coming
>> up
>> > as well. Just a thought.
>> >
>> >
>> > Regards,
>> > Abhilash
>> >
>> > On Wed, Sep 2, 2015 at 2:57 PM, Li Yang <[email protected]> wrote:
>> >
>> > > Kylin assumes lookup table to be small (<100MB), thus can fit in
>> memory.
>> > > In your model, if order or customer go beyond millions, then they
>> have to
>> > > be on the fact table.  Like Hongbin mentioned, an easy way is to use a
>> > hive
>> > > view.
>> > >
>> > > About analyzing ultra-high cardinality columns (like millions of
>> > > customers), we see two common use cases.
>> > >
>> > > 1. TopN analysis.  Returning a millions records is not useful at all,
>> > > instread, returning the TopN big customer makes much better sense.
>> > > KYLIN-943 <https://issues.apache.org/jira/browse/KYLIN-943> is a new
>> > > feature under development that aims to respond to TopN queries in
>> > > subsecond.
>> > >
>> > > 2. Focused analysis.  Looking at a specific customer (e.g. where
>> > > customer=A).  Such query can be very fast by creating a cube with
>> > customer
>> > > as a Mandatory dimension.
>> > >
>> > > Cheers
>> > > Yang
>> > >
>> > > On Tue, Sep 1, 2015 at 11:23 PM, hongbin ma <[email protected]>
>> > wrote:
>> > >
>> > > > Kylin handles star schema well, but my encounter issues like OOM on
>> > your
>> > > > case.
>> > > > How many large lookup tables do you have?
>> > > > I'm not sure if a evict policy will help because anytime a SQL
>> involves
>> > > the
>> > > > lookup table, the lookup table snapshot will have to be loaded
>> again(so
>> > > the
>> > > > snapshots are swapping-in-swapping-out)
>> > > >
>> > > > One way to solve the problem is to join your tables into a flatten
>> > table
>> > > > using Hive view, providing Kylin with single big fact table. And
>> please
>> > > > notice avoid using dictionary on high cardinality columns.
>> > > >
>> > > > On Tue, Sep 1, 2015 at 11:16 PM, Abhilash L L <
>> [email protected]>
>> > > > wrote:
>> > > >
>> > > > > Thanks for replying Hongbin,
>> > > > >
>> > > > >      for 1) we are trying to add some sort of evitction based
>> cache
>> > > > instead
>> > > > > of a map. However, we still are trying to figure out what to do
>> for
>> > 3).
>> > > > >
>> > > > >     What is the general advice ? The case here is ..  I have order
>> > > > details
>> > > > > as a fact and order as a dimension and also customer. Now each of
>> > these
>> > > > > will run into many millions.  Also, the f-key is not a
>> long/bigint,
>> > > its a
>> > > > > string which is a combination of our custom columns. Making it a
>> > > > dictionary
>> > > > > will not work as we understand. Please suggest what should be the
>> > > > approach
>> > > > > taken
>> > > > >
>> > > > > Regards,
>> > > > > Abhilash
>> > > > >
>> > > > > On Tue, Sep 1, 2015 at 4:37 PM, hongbin ma <[email protected]>
>> > > wrote:
>> > > > >
>> > > > > >     for 1) ..  seems like only the resource path / table desc
>> etc
>> > is
>> > > > only
>> > > > > > kept in memory while a new lookupstringtable is created per
>> > > > query/request
>> > > > > > which holds onto data for the lifetime of the request.  So once
>> the
>> > > > > request
>> > > > > > is done, it should be garbage collectable ?
>> > > > > >
>> > > > > > /table is just for the hive table's schema, the look up table
>> > content
>> > > > is
>> > > > > > cached in SnapshotManager and it will not be evicted so far. So
>> if
>> > > you
>> > > > > have
>> > > > > > a lot of large lookup tables this will be a problem
>> > > > > >
>> > > > > >
>> > > > > > 3) Also the derived filter translator, is there a way to modify
>> > the '
>> > > > > > IN_THRESHOLD'  via config file ?
>> > > > > >
>> > > > > > Are you facing performance issue with a lot of IN clauses? if
>> so ,
>> > > > please
>> > > > > > take a look at https://issues.apache.org/jira/browse/KYLIN-740,
>> > the
>> > > > > patch
>> > > > > > will be merged into next release
>> > > > > >
>> > > > > > On Mon, Aug 31, 2015 at 9:54 PM, Abhilash L L <
>> > [email protected]
>> > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Sorry for the confusion,
>> > > > > > >
>> > > > > > >     for 1) ..  seems like only the resource path / table desc
>> etc
>> > > is
>> > > > > only
>> > > > > > > kept in memory while a new lookupstringtable is created per
>> > > > > query/request
>> > > > > > > which holds onto data for the lifetime of the request.  So
>> once
>> > the
>> > > > > > request
>> > > > > > > is done, it should be garbage collectable ?
>> > > > > > >
>> > > > > > >
>> > > > > > > 3) Also the derived filter translator, is there a way to
>> modify
>> > > the '
>> > > > > > > IN_THRESHOLD'  via config file ?
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > Regards,
>> > > > > > > Abhilash
>> > > > > > >
>> > > > > > > On Mon, Aug 31, 2015 at 7:05 PM, Abhilash L L <
>> > > [email protected]
>> > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hello,
>> > > > > > > >
>> > > > > > > >     We started noticing that Kylin tomcat server is taking a
>> > lot
>> > > of
>> > > > > > ram.
>> > > > > > > > It even hit a limit of 10GB.
>> > > > > > > >
>> > > > > > > >     After spending some time by going over the code, it
>> seems
>> > > like
>> > > > > the
>> > > > > > > > cube enumerator is not storing anything in memory. But the
>> > Lookup
>> > > > > table
>> > > > > > > > enumerator seems to be loading all records and storing it in
>> > > > memory.
>> > > > > > > >
>> > > > > > > >     1) What happens when there are lot of projects defined
>> and
>> > we
>> > > > end
>> > > > > > up
>> > > > > > > > with tons of look up tables across them. Does it get swapped
>> > out
>> > > > > > > > automatically ?  I am not able to track where eviction is
>> > > > happening.
>> > > > > > The
>> > > > > > > > snapshot manager has a 'removeSnapshot' but its intent seems
>> > > > > different
>> > > > > > to
>> > > > > > > > me.
>> > > > > > > >
>> > > > > > > >     2) How do we handle really higher cardinality dimension.
>> > Eg:
>> > > > If I
>> > > > > > > have
>> > > > > > > > sales as a fact and customers as a dimension, there will be
>> > > > millions
>> > > > > of
>> > > > > > > > customers. However a store is good candidate to keep in
>> memory
>> > > but
>> > > > > not
>> > > > > > > > customers. Whats the recommended setting while creating the
>> > cube
>> > > to
>> > > > > > > handle
>> > > > > > > > such a case
>> > > > > > > >
>> > > > > > > > Regards,
>> > > > > > > > Abhilash
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Regards,
>> > > > > >
>> > > > > > *Bin Mahone | 马洪宾*
>> > > > > > Apache Kylin: http://kylin.io
>> > > > > > Github: https://github.com/binmahone
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Regards,
>> > > >
>> > > > *Bin Mahone | 马洪宾*
>> > > > Apache Kylin: http://kylin.io
>> > > > Github: https://github.com/binmahone
>> > > >
>> > >
>> >
>>
>
>

Re: Lookup Table Enumerator high memory

Reply via email to