pls ignore the last reply above 2015-06-16 14:26 GMT+08:00 dong wang <[email protected]>:
> based on the 3rd question above, why we should keep a "STAR" schema > instead of a "SNOW" schema? if we keep a "SNOW" schema, the 3rd problem > should be gone since there is only "(region_id, region_name)" in the lookup > talble lk_region? > > 2015-06-16 13:55 GMT+08:00 dong wang <[email protected]>: > >> Thanks shaofeng~ >> 1, you mentioned that all the PK columns should be included in the >> SQL/Query, does it mean that all the PK columns should be only in the >> "JOIN" parts or be only in the "SELECT" parts, or either is OK as long as >> it appears in the SQL statement? >> >> 2, suppose that all the PK columns are all in the SQL/Query based on the >> answer of the 1st question above, which of the above approaches to create >> derived dimension is the correct approach? !!! >> >> 3, as we know that Kylin wants user to have a STAR schema model, then >> take the following as an example: >> ====================Hive tables======================== >> fact table: ft >> region_id >> m1 >> m2 >> >> >> lookup table: lk1 >> region_id >> region_name >> province_id >> province_name >> city_id >> city_name >> >> as we can see that the fact table ft only has on region_id, thus when >> defining a cube with join, we can only join "region_id" on fact table 'ft' >> and region_id on lookup talbe 'lk1', suppose the primary key of lk1 are >> compound keys(region_id + province_id + city_id), as tested just now, we >> cannot build such a cube successfully due to a "duplicate key" error since >> only region_id cannot determine one unique record for the lookup table >> 'lk1', is it related to what you mentioned above? >> >> 2015-06-16 13:31 GMT+08:00 ShaoFeng Shi <[email protected]>: >> >>> I just realized that both approaches may not work as you expected; >>> >>> The "derived" columns are derived from foreign keys, in your case the >>> fk/pk >>> are composited keys: region_id + province_id + city_id; That's to say, >>> only >>> when all these columns appeared in your query, Kylin can derive the >>> names, >>> even if one region_id can determin the region name; >>> >>> Only if you're sure all these FKs always appeared in SQL, you can use the >>> derived columns in this way; Otherwise, don't expect Kylin can smartly >>> conclude the region name with just region_id, as region_id is not PK, >>> there >>> is no guarrantee on the uniqueness; >>> >>> >>> >>> 2015-06-16 12:00 GMT+08:00 dong wang <[email protected]>: >>> >>> > Approach 1: >>> > >>> > Dimension >>> > Hierarhy 1: region_id -> province_id -> city_id (NOTE: all of these 3 >>> > columns come from fact table: ft) >>> > Derived 1: region_id, region_name (NOTE: it comes from lookup table: >>> lk1) >>> > Derived 2: province_id, province_name(NOTE: it comes from lookup table: >>> > lk1) >>> > Derived 3: city_id, city_name(NOTE: it comes from lookup table: lk1) >>> > >>> > Approach 2 >>> > Dimension >>> > Hierarhy 1: region_id -> province_id -> city_id (NOTE: all of these 3 >>> > columns come from fact table: ft) >>> > Derive 1: (region_name, province_name, city_name) //all is in one >>> derived >>> > dimension, and which comes from lookup table: lk1 >>> > >>> > I am not sure which of the above approaches is the correct way? please >>> help >>> > to confirm? >>> > >>> > 2015-06-16 11:12 GMT+08:00 Shi, Shaofeng <[email protected]>: >>> > >>> > > Yes it is expected; The ³derived² columns are those can be derived by >>> > fks; >>> > > So in the cube we only keep the *_ids in dimensions; Kylin will take >>> > > snapshot for the lookup table and load them into memory, in run-time >>> it >>> > > can map the ids to the names; >>> > > >>> > > >>> > > On 6/15/15, 4:34 PM, "dong wang" <[email protected]> wrote: >>> > > >>> > > >====================Hive tables======================== >>> > > >fact table: ft >>> > > >region_id >>> > > >province_id >>> > > >city_id >>> > > > >>> > > >lookup table: lk1 >>> > > >region_id >>> > > >region_name >>> > > >province_id >>> > > >province_name >>> > > >city_id >>> > > >city_name >>> > > > >>> > > > >>> > > >====================Cube Design======================== >>> > > > >>> > > >JOIN: inner >>> > > > >>> > > >ft.region_id=lk1.region_id >>> > > >and ft.province_id=lk1.province_id >>> > > >and ft.city_id=lk1.city_id >>> > > > >>> > > > >>> > > >Dimension >>> > > > >>> > > >Hierarhy 1: region_id -> province_id -> city_id (NOTE: all of >>> these 3 >>> > > >columns come from fact table: ft) >>> > > >Derived 1: region_id, region_name (NOTE: it comes from lookup table: >>> > lk1) >>> > > >Derived 2: province_id, province_name(NOTE: it comes from lookup >>> table: >>> > > >lk1) >>> > > >Derived 3: city_id, city_name(NOTE: it comes from lookup table: lk1) >>> > > > >>> > > > >>> > > >is it correct to create the derived dimensions? >>> > > > >>> > > > >>> > > >====================Cube Building======================== >>> > > >Build the cube: >>> > > > >>> > > >when the first step, the log says as below: >>> > > > >>> > > >CREATE EXTERNAL TABLE IF NOT EXISTS >>> > > >>> > >>> >kylin_intermediate_test_cube_1_20150101000000_20150601000000_635c985b_3212 >>> > > >_4dd5_9ffa_f100e4c323e1 >>> > > >( >>> > > >test_cube_1_DT string >>> > > >,test_cube_1_region_id int >>> > > >,test_cube_1_province_id int >>> > > >,test_cube_1_city_id int >>> > > >,test_cube_1_m1 bigint >>> > > >,test_cube_1_m2 bigint >>> > > >) >>> > > >ROW FORMAT DELIMITED FIELDS TERMINATED BY '\177' >>> > > >STORED AS SEQUENCEFILE >>> > > > >>> > > > >>> > > >we can see that "region_name", "province_name", "city_name" don't >>> appear >>> > > >in >>> > > >the intermediate hive table created in the 1st step, instead, it >>> only >>> > > >contains "*_id" columns as listed above, is it expected? >>> > > >>> > > >>> > >>> >> >> >
