Re: Question about derived dimension

hongbin ma Tue, 28 Jul 2015 02:53:43 -0700

derived column is used when one dimension can be deduced from another. *(They
must be dimension on lookup table)*
for example, suppose we have a lookup table:

dimA    dimB  dimC
1           a        ?
2           b        ?
3           c        ?
4           a        ?

in this case, given a value in DimA, the value of DimB is determined, so we
say dimB can be derived from DimA. When we build a cube that contains both
DimA and DimB, we simple include DimA, and marking DimB as derived. Derived
column(DimB) does not participant in cuboid spanning, thus reducing the
possible combinations:

original combinations:
ABC,AB,AC,BC,A,B,C

combinations when driving B from A:
AC,A,C

at Runtime, in case queries like "select count(*) from fact_table inner
join looup1 group by looup1 .dimB", it is expecting cuboid containing DimB
to answer the query. However, DimB will appear in none of the cuboids due
to derived optimization. In this case, we modify the execution plan to make
it group by  DimA(its deriving dimension) first, we'll get intermediate
answer like:

DimA  count(*)
1          1
2          1
3          1
4          1

Afterwards, Kylin will replace DimA values with DimB values(since both of
them are dimensions on lookup table, Kylin can load the whole lookup table
into memory and build a mapping for them), and the intermediate result
becomes:

DimB  count(*)
a          1
b          1
c          1
a          1

After this, the runtime SQL engine(calcite) will further aggregate the
intermediate result to:

DimB  count(*)
a          2
b          1
c          1

this step happens at query runtime, this is what it means "at the cost of
extra runtime aggregation"

On Tue, Jul 28, 2015 at 11:21 AM, Dayue Gao <[email protected]> wrote:

> Hi Kylin Developers,
>
> I'm confused about when to use derived dimension and how it will affects
> query performance. So far I've known that derived dimension will not be
> included in row key and value, then how kylin achieves to implement query
> grouping by these dimensions? The "Design Cube in Kylin" document says
> "Derived dimension reduces combination from 2N to 2 at the cost of extra
> runtime aggregation." I find it hard to understand.
>
> Also, it's not clear to me what is the requirement for using derived
> dimension. In "Add Dimension" step, there is a tip saying "Using Derived
> for One-One relationship between columns, like ID and Name". Does it mean
> derived dimension should be unique key of the lookup table? If that's the
> case, the applicable scenario seams limited.
>
> I'm aware that there's a JIRA
> https://issues.apache.org/jira/browse/KYLIN-887 <
> https://issues.apache.org/jira/browse/KYLIN-887> for it, but I can't wait
> to know the answer :-)
>
> Regards,
> Dayue

-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

Re: Question about derived dimension

Reply via email to