Re: is it expected when building a cube with join?

dong wang Mon, 15 Jun 2015 22:56:07 -0700

Thanks shaofeng~
1, you mentioned that all the PK columns should be included in the
SQL/Query, does it mean that all the PK columns should be only in the
"JOIN" parts or  be only in the "SELECT" parts, or either is OK as long as
it appears in the SQL statement?


2, suppose that all the PK columns are all in the SQL/Query based on the
answer of the 1st question above, which of the above approaches to create
derived dimension is the correct approach?  !!!

3, as we know that Kylin wants user to have a STAR schema model, then take
the following as an example:
====================Hive tables========================
fact table: ft
region_id
m1
m2


lookup table: lk1
region_id
region_name
province_id
province_name
city_id
city_name

as we can see that the fact table ft only has on region_id, thus when
defining a cube with join, we can only join "region_id" on fact table 'ft'
and region_id on lookup talbe 'lk1', suppose the primary key of lk1 are
compound keys(region_id + province_id + city_id), as tested just now, we
cannot build such a cube successfully due to a "duplicate key" error since
only region_id cannot determine one unique record for the lookup table
'lk1', is it related to what you mentioned above?

2015-06-16 13:31 GMT+08:00 ShaoFeng Shi <[email protected]>:

> I just realized that both approaches may not work as you expected;
>
> The "derived" columns are derived from foreign keys, in your case the fk/pk
> are composited keys: region_id + province_id + city_id; That's to say, only
> when all these columns appeared in your query, Kylin can derive the names,
> even if one region_id can determin the region name;
>
> Only if you're sure all these FKs always appeared in SQL, you can use the
> derived columns in this way; Otherwise, don't expect Kylin can smartly
> conclude the region name with just region_id, as region_id is not PK, there
> is no guarrantee on the uniqueness;
>
>
>
> 2015-06-16 12:00 GMT+08:00 dong wang <[email protected]>:
>
> > Approach 1:
> >
> > Dimension
> > Hierarhy 1:  region_id -> province_id -> city_id  (NOTE: all of these 3
> > columns come from fact table: ft)
> > Derived 1: region_id, region_name (NOTE: it comes from lookup table: lk1)
> > Derived 2: province_id, province_name(NOTE: it comes from lookup table:
> > lk1)
> > Derived 3: city_id, city_name(NOTE: it comes from lookup table: lk1)
> >
> > Approach 2
> > Dimension
> > Hierarhy 1:  region_id -> province_id -> city_id  (NOTE: all of these 3
> > columns come from fact table: ft)
> > Derive 1: (region_name, province_name, city_name)  //all is in one
> derived
> > dimension, and which comes from lookup table: lk1
> >
> > I am not sure which of the above approaches is the correct way? please
> help
> > to confirm?
> >
> > 2015-06-16 11:12 GMT+08:00 Shi, Shaofeng <[email protected]>:
> >
> > > Yes it is expected; The ³derived² columns are those can be derived by
> > fks;
> > > So in the cube we only keep the *_ids in dimensions; Kylin will take
> > > snapshot for the lookup table and load them into memory, in run-time it
> > > can map the ids to the names;
> > >
> > >
> > > On 6/15/15, 4:34 PM, "dong wang" <[email protected]> wrote:
> > >
> > > >====================Hive tables========================
> > > >fact table: ft
> > > >region_id
> > > >province_id
> > > >city_id
> > > >
> > > >lookup table: lk1
> > > >region_id
> > > >region_name
> > > >province_id
> > > >province_name
> > > >city_id
> > > >city_name
> > > >
> > > >
> > > >====================Cube Design========================
> > > >
> > > >JOIN:   inner
> > > >
> > > >ft.region_id=lk1.region_id
> > > >and ft.province_id=lk1.province_id
> > > >and ft.city_id=lk1.city_id
> > > >
> > > >
> > > >Dimension
> > > >
> > > >Hierarhy 1:  region_id -> province_id -> city_id  (NOTE: all of these
> 3
> > > >columns come from fact table: ft)
> > > >Derived 1: region_id, region_name (NOTE: it comes from lookup table:
> > lk1)
> > > >Derived 2: province_id, province_name(NOTE: it comes from lookup
> table:
> > > >lk1)
> > > >Derived 3: city_id, city_name(NOTE: it comes from lookup table: lk1)
> > > >
> > > >
> > > >is it correct to create the derived dimensions?
> > > >
> > > >
> > > >====================Cube Building========================
> > > >Build the cube:
> > > >
> > > >when the first step, the log says as below:
> > > >
> > > >CREATE EXTERNAL TABLE IF NOT EXISTS
> > >
> >
> >kylin_intermediate_test_cube_1_20150101000000_20150601000000_635c985b_3212
> > > >_4dd5_9ffa_f100e4c323e1
> > > >(
> > > >test_cube_1_DT string
> > > >,test_cube_1_region_id int
> > > >,test_cube_1_province_id int
> > > >,test_cube_1_city_id int
> > > >,test_cube_1_m1 bigint
> > > >,test_cube_1_m2 bigint
> > > >)
> > > >ROW FORMAT DELIMITED FIELDS TERMINATED BY '\177'
> > > >STORED AS SEQUENCEFILE
> > > >
> > > >
> > > >we can see that "region_name", "province_name", "city_name" don't
> appear
> > > >in
> > > >the intermediate hive table created in the 1st step, instead, it only
> > > >contains "*_id" columns as listed above,  is it expected?
> > >
> > >
> >
>

Re: is it expected when building a cube with join?

Reply via email to