Excellent explanation! Thanks you Hongbin!

> Afterwards, Kylin will replace DimA values with DimB values(since both of
> them are dimensions on lookup table, Kylin can load the whole lookup table
> into memory and build a mapping for them)

A further question, does it means Kylin will load all needed lookup tables for 
such queries each time? If that's the case, I think I should avoid using 
derived dimension on big dimension like users, right?

Regards,
Dayue


> 在 2015年7月28日,下午5:52,hongbin ma <[email protected]> 写道:
> 
> derived column is used when one dimension can be deduced from another. *(They
> must be dimension on lookup table)*
> for example, suppose we have a lookup table:
> 
> dimA    dimB  dimC
> 1           a        ?
> 2           b        ?
> 3           c        ?
> 4           a        ?
> 
> in this case, given a value in DimA, the value of DimB is determined, so we
> say dimB can be derived from DimA. When we build a cube that contains both
> DimA and DimB, we simple include DimA, and marking DimB as derived. Derived
> column(DimB) does not participant in cuboid spanning, thus reducing the
> possible combinations:
> 
> original combinations:
> ABC,AB,AC,BC,A,B,C
> 
> combinations when driving B from A:
> AC,A,C
> 
> at Runtime, in case queries like "select count(*) from fact_table inner
> join looup1 group by looup1 .dimB", it is expecting cuboid containing DimB
> to answer the query. However, DimB will appear in none of the cuboids due
> to derived optimization. In this case, we modify the execution plan to make
> it group by  DimA(its deriving dimension) first, we'll get intermediate
> answer like:
> 
> DimA  count(*)
> 1          1
> 2          1
> 3          1
> 4          1
> 
> Afterwards, Kylin will replace DimA values with DimB values(since both of
> them are dimensions on lookup table, Kylin can load the whole lookup table
> into memory and build a mapping for them), and the intermediate result
> becomes:
> 
> DimB  count(*)
> a          1
> b          1
> c          1
> a          1
> 
> After this, the runtime SQL engine(calcite) will further aggregate the
> intermediate result to:
> 
> DimB  count(*)
> a          2
> b          1
> c          1
> 
> this step happens at query runtime, this is what it means "at the cost of
> extra runtime aggregation"
> 
> On Tue, Jul 28, 2015 at 11:21 AM, Dayue Gao <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>> Hi Kylin Developers,
>> 
>> I'm confused about when to use derived dimension and how it will affects
>> query performance. So far I've known that derived dimension will not be
>> included in row key and value, then how kylin achieves to implement query
>> grouping by these dimensions? The "Design Cube in Kylin" document says
>> "Derived dimension reduces combination from 2N to 2 at the cost of extra
>> runtime aggregation." I find it hard to understand.
>> 
>> Also, it's not clear to me what is the requirement for using derived
>> dimension. In "Add Dimension" step, there is a tip saying "Using Derived
>> for One-One relationship between columns, like ID and Name". Does it mean
>> derived dimension should be unique key of the lookup table? If that's the
>> case, the applicable scenario seams limited.
>> 
>> I'm aware that there's a JIRA
>> https://issues.apache.org/jira/browse/KYLIN-887 <
>> https://issues.apache.org/jira/browse/KYLIN-887 
>> <https://issues.apache.org/jira/browse/KYLIN-887>> for it, but I can't wait
>> to know the answer :-)
>> 
>> Regards,
>> Dayue
> 
> 
> 
> 
> -- 
> Regards,
> 
> *Bin Mahone | 马洪宾*
> Apache Kylin: http://kylin.io <http://kylin.io/>
> Github: https://github.com/binmahone <https://github.com/binmahone>

Reply via email to