The design of lookup_table1 is problematic; you couldn't join it with table like fact_table2, fact_table3 only on region_id; let’s see an example:
lookup_table1 (assume it has two records with region_id R1): region_id, region_name, province_id, province_name, city_id, city_name R1, EAST, P1, ANHUI, C1, HEFEI R1, EAST, P2, JIANGSU, C2, NANJING fact_table2 (assume it only has 1 record, region_id is R1): region_id, year_id, M1, M2,..., Mn R1, 2015, m1, m2… Join fact_table2 and lookup_table1 on region_id, you will get two records : R1, EAST, 2015, m1, m2… R1, EAST, 2015, m1, m2… This is wrong, as the measures are repeated; A better design might be, having lookup table for each layer: Region: region_id, region_name Province: province_id, province_name City: city_id, city_name Then all the names can be derived from the id on fact table; Just some my cents, let me know if I’m wrong... On 6/17/15, 9:39 PM, "dong wang" <[email protected]> wrote: >Hi shaofeng, hongbin, > >for case 1: there should be some weakness, suppose that we have a "STAR" >model schema: > >lookup_table1: >region_id, region_name, province_id, province_name, city_id, city_name > >fact_table1: >region_id, province_id, city_id, M1, M2, Mn > >fact_table2: >region_id, year_id, M1, M2,..., Mn > >fact_table3: >region_id, province_id, quarter_id, M1, M2, Mn > >as mentioned, only the "fact_table1" can successfully join with >"lookup_table1", "fact_table2" and "fact_table3" cannot share >"lookup_table1", is it right? > > > >for case 2: >sorry for the misunderstanding, what I mean actually is that if the VALUE >of the city_name for some rows is changed, we will have to totally built >the cube again, is it right? can we scan all the id and values in the >fact >table and build a hashmap, and before kylin returns the data to the user >end, we only do the JOIN with the LITTLE amount of the qualified query >data >with the hashmap, and the reason why I think we should support similar >feature is that for a small hadoop cluster and HUGE amount of data, it is >VERY time-consuming to do the HIVE join when the first cube building step! > >2015-06-17 17:27 GMT+08:00 Shi, Shaofeng <[email protected]>: > >> Hi dong, ³Derived² is only for lookup table (and the lookup table need >> ensure the uniqueness of PK, so with one PK we can exactly find one >>value >> for the derived column); >> >> For case 1, Kylin will not support derive from fact table; The problem >>is, >> how can we know/ensure 1 region_id only has only 1 region_name mapped if >> you repeats region_id and region_name in each row of fact table? If >>there >> are more than 1 mapping, which one should be taken? >> >> If a table column name be changed, the cubes need be updated to use the >> new name, and a re-build is needed; So you need avoid to change >> table/column name; (I guess you¹re not mean this, is it?) >> >> On 6/17/15, 2:50 PM, "dong wang" <[email protected]> wrote: >> >> >usually, we may have 2 common cases: >> >1, only have a fact table which has already been joined with lookup >> >tables, >> >take it as a wide table >> > >> >2, we have star schema data model, and when defining the cube, we >>define >> >the join with the fact table and lookup tables, >> > >> >currently, >> > >> >to the 1st case, we cannot create derived dimensions based on the fact >> >table which can be created only from the lookup tables, for example, >>we >> >have fact table: region_id, region_name, province_id, province_name, >> >city_id, city_name, M1, M2, ...Mn, we want to create a hierarchy >> >dimension: >> >region_id->province_id->city_id, and a derived dimenison: (region_name, >> >province_name, city_name) as well, if we can achieve it, there will be >>no >> >complicated joins at all, and also, the join will take much time in the >> >first step when building the cube~ thus, will we support such demension >> >design? >> > >> >to the 2nd case, even though we make the join, however, if the column >> >names >> >are changed for some reasons, we will have to build the whole cube >>again >> >instead of only update the dimension data? >> >>
