Shaofeng, still another question, if as what you said "[region_id,
region_name, province_id, province_name, city_id, city_name]",then all of
the 6 columns should come from the lookup table lk1 instead of fact table
'ft', therefore, will this hierarchy work for the fact table 'ft'?  and if
we make join between fact table and lookup table, it is more reasonable to
create the hierarchy based on the lookup table instead of fact talbe, isn't
it?(the id columns also appear on the fact table, we can also create a
hierarchy on the fact table as well)

2015-06-16 14:39 GMT+08:00 dong wang <[email protected]>:

> pls ignore the last reply above
>
> 2015-06-16 14:26 GMT+08:00 dong wang <[email protected]>:
>
>> based on the 3rd question above, why we should keep a "STAR" schema
>> instead of a "SNOW" schema? if we keep a "SNOW" schema, the 3rd problem
>> should be gone since there is only "(region_id, region_name)" in the lookup
>> talble lk_region?
>>
>> 2015-06-16 13:55 GMT+08:00 dong wang <[email protected]>:
>>
>>> Thanks shaofeng~
>>> 1, you mentioned that all the PK columns should be included in the
>>> SQL/Query, does it mean that all the PK columns should be only in the
>>> "JOIN" parts or  be only in the "SELECT" parts, or either is OK as long as
>>> it appears in the SQL statement?
>>>
>>> 2, suppose that all the PK columns are all in the SQL/Query based on the
>>> answer of the 1st question above, which of the above approaches to create
>>> derived dimension is the correct approach?  !!!
>>>
>>> 3, as we know that Kylin wants user to have a STAR schema model, then
>>> take the following as an example:
>>> ====================Hive tables========================
>>> fact table: ft
>>> region_id
>>> m1
>>> m2
>>>
>>>
>>> lookup table: lk1
>>> region_id
>>> region_name
>>> province_id
>>> province_name
>>> city_id
>>> city_name
>>>
>>> as we can see that the fact table ft only has on region_id, thus when
>>> defining a cube with join, we can only join "region_id" on fact table 'ft'
>>> and region_id on lookup talbe 'lk1', suppose the primary key of lk1 are
>>> compound keys(region_id + province_id + city_id), as tested just now, we
>>> cannot build such a cube successfully due to a "duplicate key" error since
>>> only region_id cannot determine one unique record for the lookup table
>>> 'lk1', is it related to what you mentioned above?
>>>
>>> 2015-06-16 13:31 GMT+08:00 ShaoFeng Shi <[email protected]>:
>>>
>>>> I just realized that both approaches may not work as you expected;
>>>>
>>>> The "derived" columns are derived from foreign keys, in your case the
>>>> fk/pk
>>>> are composited keys: region_id + province_id + city_id; That's to say,
>>>> only
>>>> when all these columns appeared in your query, Kylin can derive the
>>>> names,
>>>> even if one region_id can determin the region name;
>>>>
>>>> Only if you're sure all these FKs always appeared in SQL, you can use
>>>> the
>>>> derived columns in this way; Otherwise, don't expect Kylin can smartly
>>>> conclude the region name with just region_id, as region_id is not PK,
>>>> there
>>>> is no guarrantee on the uniqueness;
>>>>
>>>>
>>>>
>>>> 2015-06-16 12:00 GMT+08:00 dong wang <[email protected]>:
>>>>
>>>> > Approach 1:
>>>> >
>>>> > Dimension
>>>> > Hierarhy 1:  region_id -> province_id -> city_id  (NOTE: all of these
>>>> 3
>>>> > columns come from fact table: ft)
>>>> > Derived 1: region_id, region_name (NOTE: it comes from lookup table:
>>>> lk1)
>>>> > Derived 2: province_id, province_name(NOTE: it comes from lookup
>>>> table:
>>>> > lk1)
>>>> > Derived 3: city_id, city_name(NOTE: it comes from lookup table: lk1)
>>>> >
>>>> > Approach 2
>>>> > Dimension
>>>> > Hierarhy 1:  region_id -> province_id -> city_id  (NOTE: all of these
>>>> 3
>>>> > columns come from fact table: ft)
>>>> > Derive 1: (region_name, province_name, city_name)  //all is in one
>>>> derived
>>>> > dimension, and which comes from lookup table: lk1
>>>> >
>>>> > I am not sure which of the above approaches is the correct way?
>>>> please help
>>>> > to confirm?
>>>> >
>>>> > 2015-06-16 11:12 GMT+08:00 Shi, Shaofeng <[email protected]>:
>>>> >
>>>> > > Yes it is expected; The ³derived² columns are those can be derived
>>>> by
>>>> > fks;
>>>> > > So in the cube we only keep the *_ids in dimensions; Kylin will take
>>>> > > snapshot for the lookup table and load them into memory, in
>>>> run-time it
>>>> > > can map the ids to the names;
>>>> > >
>>>> > >
>>>> > > On 6/15/15, 4:34 PM, "dong wang" <[email protected]> wrote:
>>>> > >
>>>> > > >====================Hive tables========================
>>>> > > >fact table: ft
>>>> > > >region_id
>>>> > > >province_id
>>>> > > >city_id
>>>> > > >
>>>> > > >lookup table: lk1
>>>> > > >region_id
>>>> > > >region_name
>>>> > > >province_id
>>>> > > >province_name
>>>> > > >city_id
>>>> > > >city_name
>>>> > > >
>>>> > > >
>>>> > > >====================Cube Design========================
>>>> > > >
>>>> > > >JOIN:   inner
>>>> > > >
>>>> > > >ft.region_id=lk1.region_id
>>>> > > >and ft.province_id=lk1.province_id
>>>> > > >and ft.city_id=lk1.city_id
>>>> > > >
>>>> > > >
>>>> > > >Dimension
>>>> > > >
>>>> > > >Hierarhy 1:  region_id -> province_id -> city_id  (NOTE: all of
>>>> these 3
>>>> > > >columns come from fact table: ft)
>>>> > > >Derived 1: region_id, region_name (NOTE: it comes from lookup
>>>> table:
>>>> > lk1)
>>>> > > >Derived 2: province_id, province_name(NOTE: it comes from lookup
>>>> table:
>>>> > > >lk1)
>>>> > > >Derived 3: city_id, city_name(NOTE: it comes from lookup table:
>>>> lk1)
>>>> > > >
>>>> > > >
>>>> > > >is it correct to create the derived dimensions?
>>>> > > >
>>>> > > >
>>>> > > >====================Cube Building========================
>>>> > > >Build the cube:
>>>> > > >
>>>> > > >when the first step, the log says as below:
>>>> > > >
>>>> > > >CREATE EXTERNAL TABLE IF NOT EXISTS
>>>> > >
>>>> >
>>>> >kylin_intermediate_test_cube_1_20150101000000_20150601000000_635c985b_3212
>>>> > > >_4dd5_9ffa_f100e4c323e1
>>>> > > >(
>>>> > > >test_cube_1_DT string
>>>> > > >,test_cube_1_region_id int
>>>> > > >,test_cube_1_province_id int
>>>> > > >,test_cube_1_city_id int
>>>> > > >,test_cube_1_m1 bigint
>>>> > > >,test_cube_1_m2 bigint
>>>> > > >)
>>>> > > >ROW FORMAT DELIMITED FIELDS TERMINATED BY '\177'
>>>> > > >STORED AS SEQUENCEFILE
>>>> > > >
>>>> > > >
>>>> > > >we can see that "region_name", "province_name", "city_name" don't
>>>> appear
>>>> > > >in
>>>> > > >the intermediate hive table created in the 1st step, instead, it
>>>> only
>>>> > > >contains "*_id" columns as listed above,  is it expected?
>>>> > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Reply via email to