For your question: yes all the 6 columns should come from lookup table; It will also work for query on fact table, kylin will smartly convert that;
2015-06-16 14:54 GMT+08:00 dong wang <[email protected]>: > Shaofeng, still another question, if as what you said "[region_id, > region_name, province_id, province_name, city_id, city_name]",then all of > the 6 columns should come from the lookup table lk1 instead of fact table > 'ft', therefore, will this hierarchy work for the fact table 'ft'? and if > we make join between fact table and lookup table, it is more reasonable to > create the hierarchy based on the lookup table instead of fact talbe, isn't > it?(the id columns also appear on the fact table, we can also create a > hierarchy on the fact table as well) > > 2015-06-16 14:39 GMT+08:00 dong wang <[email protected]>: > > > pls ignore the last reply above > > > > 2015-06-16 14:26 GMT+08:00 dong wang <[email protected]>: > > > >> based on the 3rd question above, why we should keep a "STAR" schema > >> instead of a "SNOW" schema? if we keep a "SNOW" schema, the 3rd problem > >> should be gone since there is only "(region_id, region_name)" in the > lookup > >> talble lk_region? > >> > >> 2015-06-16 13:55 GMT+08:00 dong wang <[email protected]>: > >> > >>> Thanks shaofeng~ > >>> 1, you mentioned that all the PK columns should be included in the > >>> SQL/Query, does it mean that all the PK columns should be only in the > >>> "JOIN" parts or be only in the "SELECT" parts, or either is OK as > long as > >>> it appears in the SQL statement? > >>> > >>> 2, suppose that all the PK columns are all in the SQL/Query based on > the > >>> answer of the 1st question above, which of the above approaches to > create > >>> derived dimension is the correct approach? !!! > >>> > >>> 3, as we know that Kylin wants user to have a STAR schema model, then > >>> take the following as an example: > >>> ====================Hive tables======================== > >>> fact table: ft > >>> region_id > >>> m1 > >>> m2 > >>> > >>> > >>> lookup table: lk1 > >>> region_id > >>> region_name > >>> province_id > >>> province_name > >>> city_id > >>> city_name > >>> > >>> as we can see that the fact table ft only has on region_id, thus when > >>> defining a cube with join, we can only join "region_id" on fact table > 'ft' > >>> and region_id on lookup talbe 'lk1', suppose the primary key of lk1 are > >>> compound keys(region_id + province_id + city_id), as tested just now, > we > >>> cannot build such a cube successfully due to a "duplicate key" error > since > >>> only region_id cannot determine one unique record for the lookup table > >>> 'lk1', is it related to what you mentioned above? > >>> > >>> 2015-06-16 13:31 GMT+08:00 ShaoFeng Shi <[email protected]>: > >>> > >>>> I just realized that both approaches may not work as you expected; > >>>> > >>>> The "derived" columns are derived from foreign keys, in your case the > >>>> fk/pk > >>>> are composited keys: region_id + province_id + city_id; That's to say, > >>>> only > >>>> when all these columns appeared in your query, Kylin can derive the > >>>> names, > >>>> even if one region_id can determin the region name; > >>>> > >>>> Only if you're sure all these FKs always appeared in SQL, you can use > >>>> the > >>>> derived columns in this way; Otherwise, don't expect Kylin can smartly > >>>> conclude the region name with just region_id, as region_id is not PK, > >>>> there > >>>> is no guarrantee on the uniqueness; > >>>> > >>>> > >>>> > >>>> 2015-06-16 12:00 GMT+08:00 dong wang <[email protected]>: > >>>> > >>>> > Approach 1: > >>>> > > >>>> > Dimension > >>>> > Hierarhy 1: region_id -> province_id -> city_id (NOTE: all of > these > >>>> 3 > >>>> > columns come from fact table: ft) > >>>> > Derived 1: region_id, region_name (NOTE: it comes from lookup table: > >>>> lk1) > >>>> > Derived 2: province_id, province_name(NOTE: it comes from lookup > >>>> table: > >>>> > lk1) > >>>> > Derived 3: city_id, city_name(NOTE: it comes from lookup table: lk1) > >>>> > > >>>> > Approach 2 > >>>> > Dimension > >>>> > Hierarhy 1: region_id -> province_id -> city_id (NOTE: all of > these > >>>> 3 > >>>> > columns come from fact table: ft) > >>>> > Derive 1: (region_name, province_name, city_name) //all is in one > >>>> derived > >>>> > dimension, and which comes from lookup table: lk1 > >>>> > > >>>> > I am not sure which of the above approaches is the correct way? > >>>> please help > >>>> > to confirm? > >>>> > > >>>> > 2015-06-16 11:12 GMT+08:00 Shi, Shaofeng <[email protected]>: > >>>> > > >>>> > > Yes it is expected; The ³derived² columns are those can be derived > >>>> by > >>>> > fks; > >>>> > > So in the cube we only keep the *_ids in dimensions; Kylin will > take > >>>> > > snapshot for the lookup table and load them into memory, in > >>>> run-time it > >>>> > > can map the ids to the names; > >>>> > > > >>>> > > > >>>> > > On 6/15/15, 4:34 PM, "dong wang" <[email protected]> wrote: > >>>> > > > >>>> > > >====================Hive tables======================== > >>>> > > >fact table: ft > >>>> > > >region_id > >>>> > > >province_id > >>>> > > >city_id > >>>> > > > > >>>> > > >lookup table: lk1 > >>>> > > >region_id > >>>> > > >region_name > >>>> > > >province_id > >>>> > > >province_name > >>>> > > >city_id > >>>> > > >city_name > >>>> > > > > >>>> > > > > >>>> > > >====================Cube Design======================== > >>>> > > > > >>>> > > >JOIN: inner > >>>> > > > > >>>> > > >ft.region_id=lk1.region_id > >>>> > > >and ft.province_id=lk1.province_id > >>>> > > >and ft.city_id=lk1.city_id > >>>> > > > > >>>> > > > > >>>> > > >Dimension > >>>> > > > > >>>> > > >Hierarhy 1: region_id -> province_id -> city_id (NOTE: all of > >>>> these 3 > >>>> > > >columns come from fact table: ft) > >>>> > > >Derived 1: region_id, region_name (NOTE: it comes from lookup > >>>> table: > >>>> > lk1) > >>>> > > >Derived 2: province_id, province_name(NOTE: it comes from lookup > >>>> table: > >>>> > > >lk1) > >>>> > > >Derived 3: city_id, city_name(NOTE: it comes from lookup table: > >>>> lk1) > >>>> > > > > >>>> > > > > >>>> > > >is it correct to create the derived dimensions? > >>>> > > > > >>>> > > > > >>>> > > >====================Cube Building======================== > >>>> > > >Build the cube: > >>>> > > > > >>>> > > >when the first step, the log says as below: > >>>> > > > > >>>> > > >CREATE EXTERNAL TABLE IF NOT EXISTS > >>>> > > > >>>> > > >>>> > >kylin_intermediate_test_cube_1_20150101000000_20150601000000_635c985b_3212 > >>>> > > >_4dd5_9ffa_f100e4c323e1 > >>>> > > >( > >>>> > > >test_cube_1_DT string > >>>> > > >,test_cube_1_region_id int > >>>> > > >,test_cube_1_province_id int > >>>> > > >,test_cube_1_city_id int > >>>> > > >,test_cube_1_m1 bigint > >>>> > > >,test_cube_1_m2 bigint > >>>> > > >) > >>>> > > >ROW FORMAT DELIMITED FIELDS TERMINATED BY '\177' > >>>> > > >STORED AS SEQUENCEFILE > >>>> > > > > >>>> > > > > >>>> > > >we can see that "region_name", "province_name", "city_name" don't > >>>> appear > >>>> > > >in > >>>> > > >the intermediate hive table created in the 1st step, instead, it > >>>> only > >>>> > > >contains "*_id" columns as listed above, is it expected? > >>>> > > > >>>> > > > >>>> > > >>>> > >>> > >>> > >> > > >
