Re: is it expected when building a cube with join?

dong wang Mon, 15 Jun 2015 23:27:31 -0700

based on the 3rd question above, why we should keep a "STAR" schema instead
of a "SNOW" schema? if we keep a "SNOW" schema, the 3rd problem should be
gone since there is only "(region_id, region_name)" in the lookup talble
lk_region?


2015-06-16 13:55 GMT+08:00 dong wang <[email protected]>:

> Thanks shaofeng~
> 1, you mentioned that all the PK columns should be included in the
> SQL/Query, does it mean that all the PK columns should be only in the
> "JOIN" parts or  be only in the "SELECT" parts, or either is OK as long as
> it appears in the SQL statement?
>
> 2, suppose that all the PK columns are all in the SQL/Query based on the
> answer of the 1st question above, which of the above approaches to create
> derived dimension is the correct approach?  !!!
>
> 3, as we know that Kylin wants user to have a STAR schema model, then take
> the following as an example:
> ====================Hive tables========================
> fact table: ft
> region_id
> m1
> m2
>
>
> lookup table: lk1
> region_id
> region_name
> province_id
> province_name
> city_id
> city_name
>
> as we can see that the fact table ft only has on region_id, thus when
> defining a cube with join, we can only join "region_id" on fact table 'ft'
> and region_id on lookup talbe 'lk1', suppose the primary key of lk1 are
> compound keys(region_id + province_id + city_id), as tested just now, we
> cannot build such a cube successfully due to a "duplicate key" error since
> only region_id cannot determine one unique record for the lookup table
> 'lk1', is it related to what you mentioned above?
>
> 2015-06-16 13:31 GMT+08:00 ShaoFeng Shi <[email protected]>:
>
>> I just realized that both approaches may not work as you expected;
>>
>> The "derived" columns are derived from foreign keys, in your case the
>> fk/pk
>> are composited keys: region_id + province_id + city_id; That's to say,
>> only
>> when all these columns appeared in your query, Kylin can derive the names,
>> even if one region_id can determin the region name;
>>
>> Only if you're sure all these FKs always appeared in SQL, you can use the
>> derived columns in this way; Otherwise, don't expect Kylin can smartly
>> conclude the region name with just region_id, as region_id is not PK,
>> there
>> is no guarrantee on the uniqueness;
>>
>>
>>
>> 2015-06-16 12:00 GMT+08:00 dong wang <[email protected]>:
>>
>> > Approach 1:
>> >
>> > Dimension
>> > Hierarhy 1:  region_id -> province_id -> city_id  (NOTE: all of these 3
>> > columns come from fact table: ft)
>> > Derived 1: region_id, region_name (NOTE: it comes from lookup table:
>> lk1)
>> > Derived 2: province_id, province_name(NOTE: it comes from lookup table:
>> > lk1)
>> > Derived 3: city_id, city_name(NOTE: it comes from lookup table: lk1)
>> >
>> > Approach 2
>> > Dimension
>> > Hierarhy 1:  region_id -> province_id -> city_id  (NOTE: all of these 3
>> > columns come from fact table: ft)
>> > Derive 1: (region_name, province_name, city_name)  //all is in one
>> derived
>> > dimension, and which comes from lookup table: lk1
>> >
>> > I am not sure which of the above approaches is the correct way? please
>> help
>> > to confirm?
>> >
>> > 2015-06-16 11:12 GMT+08:00 Shi, Shaofeng <[email protected]>:
>> >
>> > > Yes it is expected; The ³derived² columns are those can be derived by
>> > fks;
>> > > So in the cube we only keep the *_ids in dimensions; Kylin will take
>> > > snapshot for the lookup table and load them into memory, in run-time
>> it
>> > > can map the ids to the names;
>> > >
>> > >
>> > > On 6/15/15, 4:34 PM, "dong wang" <[email protected]> wrote:
>> > >
>> > > >====================Hive tables========================
>> > > >fact table: ft
>> > > >region_id
>> > > >province_id
>> > > >city_id
>> > > >
>> > > >lookup table: lk1
>> > > >region_id
>> > > >region_name
>> > > >province_id
>> > > >province_name
>> > > >city_id
>> > > >city_name
>> > > >
>> > > >
>> > > >====================Cube Design========================
>> > > >
>> > > >JOIN:   inner
>> > > >
>> > > >ft.region_id=lk1.region_id
>> > > >and ft.province_id=lk1.province_id
>> > > >and ft.city_id=lk1.city_id
>> > > >
>> > > >
>> > > >Dimension
>> > > >
>> > > >Hierarhy 1:  region_id -> province_id -> city_id  (NOTE: all of
>> these 3
>> > > >columns come from fact table: ft)
>> > > >Derived 1: region_id, region_name (NOTE: it comes from lookup table:
>> > lk1)
>> > > >Derived 2: province_id, province_name(NOTE: it comes from lookup
>> table:
>> > > >lk1)
>> > > >Derived 3: city_id, city_name(NOTE: it comes from lookup table: lk1)
>> > > >
>> > > >
>> > > >is it correct to create the derived dimensions?
>> > > >
>> > > >
>> > > >====================Cube Building========================
>> > > >Build the cube:
>> > > >
>> > > >when the first step, the log says as below:
>> > > >
>> > > >CREATE EXTERNAL TABLE IF NOT EXISTS
>> > >
>> >
>> >kylin_intermediate_test_cube_1_20150101000000_20150601000000_635c985b_3212
>> > > >_4dd5_9ffa_f100e4c323e1
>> > > >(
>> > > >test_cube_1_DT string
>> > > >,test_cube_1_region_id int
>> > > >,test_cube_1_province_id int
>> > > >,test_cube_1_city_id int
>> > > >,test_cube_1_m1 bigint
>> > > >,test_cube_1_m2 bigint
>> > > >)
>> > > >ROW FORMAT DELIMITED FIELDS TERMINATED BY '\177'
>> > > >STORED AS SEQUENCEFILE
>> > > >
>> > > >
>> > > >we can see that "region_name", "province_name", "city_name" don't
>> appear
>> > > >in
>> > > >the intermediate hive table created in the 1st step, instead, it only
>> > > >contains "*_id" columns as listed above,  is it expected?
>> > >
>> > >
>> >
>>
>
>

Re: is it expected when building a cube with join?

Reply via email to