Excellent explanation! Thanks you Hongbin! > Afterwards, Kylin will replace DimA values with DimB values(since both of > them are dimensions on lookup table, Kylin can load the whole lookup table > into memory and build a mapping for them)
A further question, does it means Kylin will load all needed lookup tables for such queries each time? If that's the case, I think I should avoid using derived dimension on big dimension like users, right? Regards, Dayue > 在 2015年7月28日,下午5:52,hongbin ma <[email protected]> 写道: > > derived column is used when one dimension can be deduced from another. *(They > must be dimension on lookup table)* > for example, suppose we have a lookup table: > > dimA dimB dimC > 1 a ? > 2 b ? > 3 c ? > 4 a ? > > in this case, given a value in DimA, the value of DimB is determined, so we > say dimB can be derived from DimA. When we build a cube that contains both > DimA and DimB, we simple include DimA, and marking DimB as derived. Derived > column(DimB) does not participant in cuboid spanning, thus reducing the > possible combinations: > > original combinations: > ABC,AB,AC,BC,A,B,C > > combinations when driving B from A: > AC,A,C > > at Runtime, in case queries like "select count(*) from fact_table inner > join looup1 group by looup1 .dimB", it is expecting cuboid containing DimB > to answer the query. However, DimB will appear in none of the cuboids due > to derived optimization. In this case, we modify the execution plan to make > it group by DimA(its deriving dimension) first, we'll get intermediate > answer like: > > DimA count(*) > 1 1 > 2 1 > 3 1 > 4 1 > > Afterwards, Kylin will replace DimA values with DimB values(since both of > them are dimensions on lookup table, Kylin can load the whole lookup table > into memory and build a mapping for them), and the intermediate result > becomes: > > DimB count(*) > a 1 > b 1 > c 1 > a 1 > > After this, the runtime SQL engine(calcite) will further aggregate the > intermediate result to: > > DimB count(*) > a 2 > b 1 > c 1 > > this step happens at query runtime, this is what it means "at the cost of > extra runtime aggregation" > > On Tue, Jul 28, 2015 at 11:21 AM, Dayue Gao <[email protected] > <mailto:[email protected]>> wrote: > >> Hi Kylin Developers, >> >> I'm confused about when to use derived dimension and how it will affects >> query performance. So far I've known that derived dimension will not be >> included in row key and value, then how kylin achieves to implement query >> grouping by these dimensions? The "Design Cube in Kylin" document says >> "Derived dimension reduces combination from 2N to 2 at the cost of extra >> runtime aggregation." I find it hard to understand. >> >> Also, it's not clear to me what is the requirement for using derived >> dimension. In "Add Dimension" step, there is a tip saying "Using Derived >> for One-One relationship between columns, like ID and Name". Does it mean >> derived dimension should be unique key of the lookup table? If that's the >> case, the applicable scenario seams limited. >> >> I'm aware that there's a JIRA >> https://issues.apache.org/jira/browse/KYLIN-887 < >> https://issues.apache.org/jira/browse/KYLIN-887 >> <https://issues.apache.org/jira/browse/KYLIN-887>> for it, but I can't wait >> to know the answer :-) >> >> Regards, >> Dayue > > > > > -- > Regards, > > *Bin Mahone | 马洪宾* > Apache Kylin: http://kylin.io <http://kylin.io/> > Github: https://github.com/binmahone <https://github.com/binmahone>
