Thanks shaofeng~ 1, you mentioned that all the PK columns should be included in the SQL/Query, does it mean that all the PK columns should be only in the "JOIN" parts or be only in the "SELECT" parts, or either is OK as long as it appears in the SQL statement?
2, suppose that all the PK columns are all in the SQL/Query based on the answer of the 1st question above, which of the above approaches to create derived dimension is the correct approach? !!! 3, as we know that Kylin wants user to have a STAR schema model, then take the following as an example: ====================Hive tables======================== fact table: ft region_id m1 m2 lookup table: lk1 region_id region_name province_id province_name city_id city_name as we can see that the fact table ft only has on region_id, thus when defining a cube with join, we can only join "region_id" on fact table 'ft' and region_id on lookup talbe 'lk1', suppose the primary key of lk1 are compound keys(region_id + province_id + city_id), as tested just now, we cannot build such a cube successfully due to a "duplicate key" error since only region_id cannot determine one unique record for the lookup table 'lk1', is it related to what you mentioned above? 2015-06-16 13:31 GMT+08:00 ShaoFeng Shi <[email protected]>: > I just realized that both approaches may not work as you expected; > > The "derived" columns are derived from foreign keys, in your case the fk/pk > are composited keys: region_id + province_id + city_id; That's to say, only > when all these columns appeared in your query, Kylin can derive the names, > even if one region_id can determin the region name; > > Only if you're sure all these FKs always appeared in SQL, you can use the > derived columns in this way; Otherwise, don't expect Kylin can smartly > conclude the region name with just region_id, as region_id is not PK, there > is no guarrantee on the uniqueness; > > > > 2015-06-16 12:00 GMT+08:00 dong wang <[email protected]>: > > > Approach 1: > > > > Dimension > > Hierarhy 1: region_id -> province_id -> city_id (NOTE: all of these 3 > > columns come from fact table: ft) > > Derived 1: region_id, region_name (NOTE: it comes from lookup table: lk1) > > Derived 2: province_id, province_name(NOTE: it comes from lookup table: > > lk1) > > Derived 3: city_id, city_name(NOTE: it comes from lookup table: lk1) > > > > Approach 2 > > Dimension > > Hierarhy 1: region_id -> province_id -> city_id (NOTE: all of these 3 > > columns come from fact table: ft) > > Derive 1: (region_name, province_name, city_name) //all is in one > derived > > dimension, and which comes from lookup table: lk1 > > > > I am not sure which of the above approaches is the correct way? please > help > > to confirm? > > > > 2015-06-16 11:12 GMT+08:00 Shi, Shaofeng <[email protected]>: > > > > > Yes it is expected; The ³derived² columns are those can be derived by > > fks; > > > So in the cube we only keep the *_ids in dimensions; Kylin will take > > > snapshot for the lookup table and load them into memory, in run-time it > > > can map the ids to the names; > > > > > > > > > On 6/15/15, 4:34 PM, "dong wang" <[email protected]> wrote: > > > > > > >====================Hive tables======================== > > > >fact table: ft > > > >region_id > > > >province_id > > > >city_id > > > > > > > >lookup table: lk1 > > > >region_id > > > >region_name > > > >province_id > > > >province_name > > > >city_id > > > >city_name > > > > > > > > > > > >====================Cube Design======================== > > > > > > > >JOIN: inner > > > > > > > >ft.region_id=lk1.region_id > > > >and ft.province_id=lk1.province_id > > > >and ft.city_id=lk1.city_id > > > > > > > > > > > >Dimension > > > > > > > >Hierarhy 1: region_id -> province_id -> city_id (NOTE: all of these > 3 > > > >columns come from fact table: ft) > > > >Derived 1: region_id, region_name (NOTE: it comes from lookup table: > > lk1) > > > >Derived 2: province_id, province_name(NOTE: it comes from lookup > table: > > > >lk1) > > > >Derived 3: city_id, city_name(NOTE: it comes from lookup table: lk1) > > > > > > > > > > > >is it correct to create the derived dimensions? > > > > > > > > > > > >====================Cube Building======================== > > > >Build the cube: > > > > > > > >when the first step, the log says as below: > > > > > > > >CREATE EXTERNAL TABLE IF NOT EXISTS > > > > > > >kylin_intermediate_test_cube_1_20150101000000_20150601000000_635c985b_3212 > > > >_4dd5_9ffa_f100e4c323e1 > > > >( > > > >test_cube_1_DT string > > > >,test_cube_1_region_id int > > > >,test_cube_1_province_id int > > > >,test_cube_1_city_id int > > > >,test_cube_1_m1 bigint > > > >,test_cube_1_m2 bigint > > > >) > > > >ROW FORMAT DELIMITED FIELDS TERMINATED BY '\177' > > > >STORED AS SEQUENCEFILE > > > > > > > > > > > >we can see that "region_name", "province_name", "city_name" don't > appear > > > >in > > > >the intermediate hive table created in the 1st step, instead, it only > > > >contains "*_id" columns as listed above, is it expected? > > > > > > > > >
