Re: Lookup Table Enumerator high memory

Shi, Shaofeng Thu, 03 Sep 2015 06:52:18 -0700

Hi Abhilash,

“Mandantory” is a property on a row key column; You can see the option in
the “Advanced” step; If a column is set to “Mandantory=true”, it will be
moved to the head position of the row key; and that column will not be
aggregated when calculating the cube. This will avoid unnecessary
calculation and storage; If your query has where condition on that
required column, the query performance will be very good;


Let me give a sample; Assume I have a fact table which has the following
dimensions: date, seller, country;

Among them, date and country are low cardinality columns, seller is a high
cardinality column; As almost all my queries are having seller specified,
I set “seller” as mandatory in the row key, then this column is moved to
the head of the row key, and will not be aggregated; The HBase row key
will be like:

seller1,cal_dt,country —>
seller2,cal_dt,country —>
seller3,cal_dt,country —>
…
sellerN,cal_dt,country —>

seller1,cal_dt —>
seller2,cal_dt —>
seller3,cal_dt —>
...
sellerN,cal_dt —>

seller1,country —>
seller2,country —>
seller3,country —>

...
sellerN,country —>


As the seller’s cardinality is high, when given a seller value, the hbase
scan range will be very small, then the query performance will be good;

If you have SQLs which has no “seller” specified, in that case this cube
may not provide same response time; We would suggest user to create
another cube without seller dimension; Multiple cubes can co-exist in one
project and Kylin will pick up the most-appropriate cube to serve the
queries;



On 9/2/15, 7:41 PM, "Abhilash L L" <[email protected]> wrote:

>Thanks for explanations Hongbin and Li,
>
>   We seem to have a decent understanding of hierarchical and derived
>dimensions.
>
>   For hierarchical, the columns part of the hierarchy also participate in
>adding an extra level to cubiods. They become part of rowkey as well and
>cubing happens on those columns as well.
>
>   For dervied, the query is rewritten to use the join key and then the in
>memory look up table is used to rewrite the hbase response to values with
>the derived dimension.
>
>   However there is something called a 'Normal' dimension (only one column
>at a time), which we are trying to see how it works during query
>resolution. Is this the mandatory dimension ? But since the UI allows only
>column per 'Normal' dimension do we have to create one for each column ?
>
>
> Also, a good write up about the types of dimensions and when to use each
>type will be really helpful for users, who do not want get into the code
>to
>figure out stuff. The clarification seeking requests might keep coming up
>as well. Just a thought.
>
>
>Regards,
>Abhilash
>
>On Wed, Sep 2, 2015 at 2:57 PM, Li Yang <[email protected]> wrote:
>
>> Kylin assumes lookup table to be small (<100MB), thus can fit in memory.
>> In your model, if order or customer go beyond millions, then they have
>>to
>> be on the fact table.  Like Hongbin mentioned, an easy way is to use a
>>hive
>> view.
>>
>> About analyzing ultra-high cardinality columns (like millions of
>> customers), we see two common use cases.
>>
>> 1. TopN analysis.  Returning a millions records is not useful at all,
>> instread, returning the TopN big customer makes much better sense.
>> KYLIN-943 <https://issues.apache.org/jira/browse/KYLIN-943> is a new
>> feature under development that aims to respond to TopN queries in
>> subsecond.
>>
>> 2. Focused analysis.  Looking at a specific customer (e.g. where
>> customer=A).  Such query can be very fast by creating a cube with
>>customer
>> as a Mandatory dimension.
>>
>> Cheers
>> Yang
>>
>> On Tue, Sep 1, 2015 at 11:23 PM, hongbin ma <[email protected]>
>>wrote:
>>
>> > Kylin handles star schema well, but my encounter issues like OOM on
>>your
>> > case.
>> > How many large lookup tables do you have?
>> > I'm not sure if a evict policy will help because anytime a SQL
>>involves
>> the
>> > lookup table, the lookup table snapshot will have to be loaded
>>again(so
>> the
>> > snapshots are swapping-in-swapping-out)
>> >
>> > One way to solve the problem is to join your tables into a flatten
>>table
>> > using Hive view, providing Kylin with single big fact table. And
>>please
>> > notice avoid using dictionary on high cardinality columns.
>> >
>> > On Tue, Sep 1, 2015 at 11:16 PM, Abhilash L L <[email protected]>
>> > wrote:
>> >
>> > > Thanks for replying Hongbin,
>> > >
>> > >      for 1) we are trying to add some sort of evitction based cache
>> > instead
>> > > of a map. However, we still are trying to figure out what to do for
>>3).
>> > >
>> > >     What is the general advice ? The case here is ..  I have order
>> > details
>> > > as a fact and order as a dimension and also customer. Now each of
>>these
>> > > will run into many millions.  Also, the f-key is not a long/bigint,
>> its a
>> > > string which is a combination of our custom columns. Making it a
>> > dictionary
>> > > will not work as we understand. Please suggest what should be the
>> > approach
>> > > taken
>> > >
>> > > Regards,
>> > > Abhilash
>> > >
>> > > On Tue, Sep 1, 2015 at 4:37 PM, hongbin ma <[email protected]>
>> wrote:
>> > >
>> > > >     for 1) ..  seems like only the resource path / table desc etc
>>is
>> > only
>> > > > kept in memory while a new lookupstringtable is created per
>> > query/request
>> > > > which holds onto data for the lifetime of the request.  So once
>>the
>> > > request
>> > > > is done, it should be garbage collectable ?
>> > > >
>> > > > /table is just for the hive table's schema, the look up table
>>content
>> > is
>> > > > cached in SnapshotManager and it will not be evicted so far. So if
>> you
>> > > have
>> > > > a lot of large lookup tables this will be a problem
>> > > >
>> > > >
>> > > > 3) Also the derived filter translator, is there a way to modify
>>the '
>> > > > IN_THRESHOLD'  via config file ?
>> > > >
>> > > > Are you facing performance issue with a lot of IN clauses? if so ,
>> > please
>> > > > take a look at https://issues.apache.org/jira/browse/KYLIN-740,
>>the
>> > > patch
>> > > > will be merged into next release
>> > > >
>> > > > On Mon, Aug 31, 2015 at 9:54 PM, Abhilash L L
>><[email protected]
>> >
>> > > > wrote:
>> > > >
>> > > > > Sorry for the confusion,
>> > > > >
>> > > > >     for 1) ..  seems like only the resource path / table desc
>>etc
>> is
>> > > only
>> > > > > kept in memory while a new lookupstringtable is created per
>> > > query/request
>> > > > > which holds onto data for the lifetime of the request.  So once
>>the
>> > > > request
>> > > > > is done, it should be garbage collectable ?
>> > > > >
>> > > > >
>> > > > > 3) Also the derived filter translator, is there a way to modify
>> the '
>> > > > > IN_THRESHOLD'  via config file ?
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > Regards,
>> > > > > Abhilash
>> > > > >
>> > > > > On Mon, Aug 31, 2015 at 7:05 PM, Abhilash L L <
>> [email protected]
>> > >
>> > > > > wrote:
>> > > > >
>> > > > > > Hello,
>> > > > > >
>> > > > > >     We started noticing that Kylin tomcat server is taking a
>>lot
>> of
>> > > > ram.
>> > > > > > It even hit a limit of 10GB.
>> > > > > >
>> > > > > >     After spending some time by going over the code, it seems
>> like
>> > > the
>> > > > > > cube enumerator is not storing anything in memory. But the
>>Lookup
>> > > table
>> > > > > > enumerator seems to be loading all records and storing it in
>> > memory.
>> > > > > >
>> > > > > >     1) What happens when there are lot of projects defined
>>and we
>> > end
>> > > > up
>> > > > > > with tons of look up tables across them. Does it get swapped
>>out
>> > > > > > automatically ?  I am not able to track where eviction is
>> > happening.
>> > > > The
>> > > > > > snapshot manager has a 'removeSnapshot' but its intent seems
>> > > different
>> > > > to
>> > > > > > me.
>> > > > > >
>> > > > > >     2) How do we handle really higher cardinality dimension.
>>Eg:
>> > If I
>> > > > > have
>> > > > > > sales as a fact and customers as a dimension, there will be
>> > millions
>> > > of
>> > > > > > customers. However a store is good candidate to keep in memory
>> but
>> > > not
>> > > > > > customers. Whats the recommended setting while creating the
>>cube
>> to
>> > > > > handle
>> > > > > > such a case
>> > > > > >
>> > > > > > Regards,
>> > > > > > Abhilash
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Regards,
>> > > >
>> > > > *Bin Mahone | 马洪宾*
>> > > > Apache Kylin: http://kylin.io
>> > > > Github: https://github.com/binmahone
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Regards,
>> >
>> > *Bin Mahone | 马洪宾*
>> > Apache Kylin: http://kylin.io
>> > Github: https://github.com/binmahone
>> >
>>

Re: Lookup Table Enumerator high memory

Reply via email to