derived column is used when one dimension can be deduced from another. *(They must be dimension on lookup table)* for example, suppose we have a lookup table:
dimA dimB dimC 1 a ? 2 b ? 3 c ? 4 a ? in this case, given a value in DimA, the value of DimB is determined, so we say dimB can be derived from DimA. When we build a cube that contains both DimA and DimB, we simple include DimA, and marking DimB as derived. Derived column(DimB) does not participant in cuboid spanning, thus reducing the possible combinations: original combinations: ABC,AB,AC,BC,A,B,C combinations when driving B from A: AC,A,C at Runtime, in case queries like "select count(*) from fact_table inner join looup1 group by looup1 .dimB", it is expecting cuboid containing DimB to answer the query. However, DimB will appear in none of the cuboids due to derived optimization. In this case, we modify the execution plan to make it group by DimA(its deriving dimension) first, we'll get intermediate answer like: DimA count(*) 1 1 2 1 3 1 4 1 Afterwards, Kylin will replace DimA values with DimB values(since both of them are dimensions on lookup table, Kylin can load the whole lookup table into memory and build a mapping for them), and the intermediate result becomes: DimB count(*) a 1 b 1 c 1 a 1 After this, the runtime SQL engine(calcite) will further aggregate the intermediate result to: DimB count(*) a 2 b 1 c 1 this step happens at query runtime, this is what it means "at the cost of extra runtime aggregation" On Tue, Jul 28, 2015 at 11:21 AM, Dayue Gao <[email protected]> wrote: > Hi Kylin Developers, > > I'm confused about when to use derived dimension and how it will affects > query performance. So far I've known that derived dimension will not be > included in row key and value, then how kylin achieves to implement query > grouping by these dimensions? The "Design Cube in Kylin" document says > "Derived dimension reduces combination from 2N to 2 at the cost of extra > runtime aggregation." I find it hard to understand. > > Also, it's not clear to me what is the requirement for using derived > dimension. In "Add Dimension" step, there is a tip saying "Using Derived > for One-One relationship between columns, like ID and Name". Does it mean > derived dimension should be unique key of the lookup table? If that's the > case, the applicable scenario seams limited. > > I'm aware that there's a JIRA > https://issues.apache.org/jira/browse/KYLIN-887 < > https://issues.apache.org/jira/browse/KYLIN-887> for it, but I can't wait > to know the answer :-) > > Regards, > Dayue -- Regards, *Bin Mahone | 马洪宾* Apache Kylin: http://kylin.io Github: https://github.com/binmahone
