Re: is it expected when building a cube with join?

dong wang Mon, 15 Jun 2015 23:39:40 -0700

pls ignore the last reply above

2015-06-16 14:26 GMT+08:00 dong wang <[email protected]>:


> based on the 3rd question above, why we should keep a "STAR" schema
> instead of a "SNOW" schema? if we keep a "SNOW" schema, the 3rd problem
> should be gone since there is only "(region_id, region_name)" in the lookup
> talble lk_region?
>
> 2015-06-16 13:55 GMT+08:00 dong wang <[email protected]>:
>
>> Thanks shaofeng~
>> 1, you mentioned that all the PK columns should be included in the
>> SQL/Query, does it mean that all the PK columns should be only in the
>> "JOIN" parts or  be only in the "SELECT" parts, or either is OK as long as
>> it appears in the SQL statement?
>>
>> 2, suppose that all the PK columns are all in the SQL/Query based on the
>> answer of the 1st question above, which of the above approaches to create
>> derived dimension is the correct approach?  !!!
>>
>> 3, as we know that Kylin wants user to have a STAR schema model, then
>> take the following as an example:
>> ====================Hive tables========================
>> fact table: ft
>> region_id
>> m1
>> m2
>>
>>
>> lookup table: lk1
>> region_id
>> region_name
>> province_id
>> province_name
>> city_id
>> city_name
>>
>> as we can see that the fact table ft only has on region_id, thus when
>> defining a cube with join, we can only join "region_id" on fact table 'ft'
>> and region_id on lookup talbe 'lk1', suppose the primary key of lk1 are
>> compound keys(region_id + province_id + city_id), as tested just now, we
>> cannot build such a cube successfully due to a "duplicate key" error since
>> only region_id cannot determine one unique record for the lookup table
>> 'lk1', is it related to what you mentioned above?
>>
>> 2015-06-16 13:31 GMT+08:00 ShaoFeng Shi <[email protected]>:
>>
>>> I just realized that both approaches may not work as you expected;
>>>
>>> The "derived" columns are derived from foreign keys, in your case the
>>> fk/pk
>>> are composited keys: region_id + province_id + city_id; That's to say,
>>> only
>>> when all these columns appeared in your query, Kylin can derive the
>>> names,
>>> even if one region_id can determin the region name;
>>>
>>> Only if you're sure all these FKs always appeared in SQL, you can use the
>>> derived columns in this way; Otherwise, don't expect Kylin can smartly
>>> conclude the region name with just region_id, as region_id is not PK,
>>> there
>>> is no guarrantee on the uniqueness;
>>>
>>>
>>>
>>> 2015-06-16 12:00 GMT+08:00 dong wang <[email protected]>:
>>>
>>> > Approach 1:
>>> >
>>> > Dimension
>>> > Hierarhy 1:  region_id -> province_id -> city_id  (NOTE: all of these 3
>>> > columns come from fact table: ft)
>>> > Derived 1: region_id, region_name (NOTE: it comes from lookup table:
>>> lk1)
>>> > Derived 2: province_id, province_name(NOTE: it comes from lookup table:
>>> > lk1)
>>> > Derived 3: city_id, city_name(NOTE: it comes from lookup table: lk1)
>>> >
>>> > Approach 2
>>> > Dimension
>>> > Hierarhy 1:  region_id -> province_id -> city_id  (NOTE: all of these 3
>>> > columns come from fact table: ft)
>>> > Derive 1: (region_name, province_name, city_name)  //all is in one
>>> derived
>>> > dimension, and which comes from lookup table: lk1
>>> >
>>> > I am not sure which of the above approaches is the correct way? please
>>> help
>>> > to confirm?
>>> >
>>> > 2015-06-16 11:12 GMT+08:00 Shi, Shaofeng <[email protected]>:
>>> >
>>> > > Yes it is expected; The ³derived² columns are those can be derived by
>>> > fks;
>>> > > So in the cube we only keep the *_ids in dimensions; Kylin will take
>>> > > snapshot for the lookup table and load them into memory, in run-time
>>> it
>>> > > can map the ids to the names;
>>> > >
>>> > >
>>> > > On 6/15/15, 4:34 PM, "dong wang" <[email protected]> wrote:
>>> > >
>>> > > >====================Hive tables========================
>>> > > >fact table: ft
>>> > > >region_id
>>> > > >province_id
>>> > > >city_id
>>> > > >
>>> > > >lookup table: lk1
>>> > > >region_id
>>> > > >region_name
>>> > > >province_id
>>> > > >province_name
>>> > > >city_id
>>> > > >city_name
>>> > > >
>>> > > >
>>> > > >====================Cube Design========================
>>> > > >
>>> > > >JOIN:   inner
>>> > > >
>>> > > >ft.region_id=lk1.region_id
>>> > > >and ft.province_id=lk1.province_id
>>> > > >and ft.city_id=lk1.city_id
>>> > > >
>>> > > >
>>> > > >Dimension
>>> > > >
>>> > > >Hierarhy 1:  region_id -> province_id -> city_id  (NOTE: all of
>>> these 3
>>> > > >columns come from fact table: ft)
>>> > > >Derived 1: region_id, region_name (NOTE: it comes from lookup table:
>>> > lk1)
>>> > > >Derived 2: province_id, province_name(NOTE: it comes from lookup
>>> table:
>>> > > >lk1)
>>> > > >Derived 3: city_id, city_name(NOTE: it comes from lookup table: lk1)
>>> > > >
>>> > > >
>>> > > >is it correct to create the derived dimensions?
>>> > > >
>>> > > >
>>> > > >====================Cube Building========================
>>> > > >Build the cube:
>>> > > >
>>> > > >when the first step, the log says as below:
>>> > > >
>>> > > >CREATE EXTERNAL TABLE IF NOT EXISTS
>>> > >
>>> >
>>> >kylin_intermediate_test_cube_1_20150101000000_20150601000000_635c985b_3212
>>> > > >_4dd5_9ffa_f100e4c323e1
>>> > > >(
>>> > > >test_cube_1_DT string
>>> > > >,test_cube_1_region_id int
>>> > > >,test_cube_1_province_id int
>>> > > >,test_cube_1_city_id int
>>> > > >,test_cube_1_m1 bigint
>>> > > >,test_cube_1_m2 bigint
>>> > > >)
>>> > > >ROW FORMAT DELIMITED FIELDS TERMINATED BY '\177'
>>> > > >STORED AS SEQUENCEFILE
>>> > > >
>>> > > >
>>> > > >we can see that "region_name", "province_name", "city_name" don't
>>> appear
>>> > > >in
>>> > > >the intermediate hive table created in the 1st step, instead, it
>>> only
>>> > > >contains "*_id" columns as listed above,  is it expected?
>>> > >
>>> > >
>>> >
>>>
>>
>>
>

Re: is it expected when building a cube with join?

Reply via email to