Re: is it expected when building a cube with join?

ShaoFeng Shi Tue, 16 Jun 2015 00:06:04 -0700

For your question: yes all the 6 columns should come from lookup table; It
will also work for query on fact table, kylin will smartly convert that;


2015-06-16 14:54 GMT+08:00 dong wang <[email protected]>:

> Shaofeng, still another question, if as what you said "[region_id,
> region_name, province_id, province_name, city_id, city_name]",then all of
> the 6 columns should come from the lookup table lk1 instead of fact table
> 'ft', therefore, will this hierarchy work for the fact table 'ft'?  and if
> we make join between fact table and lookup table, it is more reasonable to
> create the hierarchy based on the lookup table instead of fact talbe, isn't
> it?(the id columns also appear on the fact table, we can also create a
> hierarchy on the fact table as well)
>
> 2015-06-16 14:39 GMT+08:00 dong wang <[email protected]>:
>
> > pls ignore the last reply above
> >
> > 2015-06-16 14:26 GMT+08:00 dong wang <[email protected]>:
> >
> >> based on the 3rd question above, why we should keep a "STAR" schema
> >> instead of a "SNOW" schema? if we keep a "SNOW" schema, the 3rd problem
> >> should be gone since there is only "(region_id, region_name)" in the
> lookup
> >> talble lk_region?
> >>
> >> 2015-06-16 13:55 GMT+08:00 dong wang <[email protected]>:
> >>
> >>> Thanks shaofeng~
> >>> 1, you mentioned that all the PK columns should be included in the
> >>> SQL/Query, does it mean that all the PK columns should be only in the
> >>> "JOIN" parts or  be only in the "SELECT" parts, or either is OK as
> long as
> >>> it appears in the SQL statement?
> >>>
> >>> 2, suppose that all the PK columns are all in the SQL/Query based on
> the
> >>> answer of the 1st question above, which of the above approaches to
> create
> >>> derived dimension is the correct approach?  !!!
> >>>
> >>> 3, as we know that Kylin wants user to have a STAR schema model, then
> >>> take the following as an example:
> >>> ====================Hive tables========================
> >>> fact table: ft
> >>> region_id
> >>> m1
> >>> m2
> >>>
> >>>
> >>> lookup table: lk1
> >>> region_id
> >>> region_name
> >>> province_id
> >>> province_name
> >>> city_id
> >>> city_name
> >>>
> >>> as we can see that the fact table ft only has on region_id, thus when
> >>> defining a cube with join, we can only join "region_id" on fact table
> 'ft'
> >>> and region_id on lookup talbe 'lk1', suppose the primary key of lk1 are
> >>> compound keys(region_id + province_id + city_id), as tested just now,
> we
> >>> cannot build such a cube successfully due to a "duplicate key" error
> since
> >>> only region_id cannot determine one unique record for the lookup table
> >>> 'lk1', is it related to what you mentioned above?
> >>>
> >>> 2015-06-16 13:31 GMT+08:00 ShaoFeng Shi <[email protected]>:
> >>>
> >>>> I just realized that both approaches may not work as you expected;
> >>>>
> >>>> The "derived" columns are derived from foreign keys, in your case the
> >>>> fk/pk
> >>>> are composited keys: region_id + province_id + city_id; That's to say,
> >>>> only
> >>>> when all these columns appeared in your query, Kylin can derive the
> >>>> names,
> >>>> even if one region_id can determin the region name;
> >>>>
> >>>> Only if you're sure all these FKs always appeared in SQL, you can use
> >>>> the
> >>>> derived columns in this way; Otherwise, don't expect Kylin can smartly
> >>>> conclude the region name with just region_id, as region_id is not PK,
> >>>> there
> >>>> is no guarrantee on the uniqueness;
> >>>>
> >>>>
> >>>>
> >>>> 2015-06-16 12:00 GMT+08:00 dong wang <[email protected]>:
> >>>>
> >>>> > Approach 1:
> >>>> >
> >>>> > Dimension
> >>>> > Hierarhy 1:  region_id -> province_id -> city_id  (NOTE: all of
> these
> >>>> 3
> >>>> > columns come from fact table: ft)
> >>>> > Derived 1: region_id, region_name (NOTE: it comes from lookup table:
> >>>> lk1)
> >>>> > Derived 2: province_id, province_name(NOTE: it comes from lookup
> >>>> table:
> >>>> > lk1)
> >>>> > Derived 3: city_id, city_name(NOTE: it comes from lookup table: lk1)
> >>>> >
> >>>> > Approach 2
> >>>> > Dimension
> >>>> > Hierarhy 1:  region_id -> province_id -> city_id  (NOTE: all of
> these
> >>>> 3
> >>>> > columns come from fact table: ft)
> >>>> > Derive 1: (region_name, province_name, city_name)  //all is in one
> >>>> derived
> >>>> > dimension, and which comes from lookup table: lk1
> >>>> >
> >>>> > I am not sure which of the above approaches is the correct way?
> >>>> please help
> >>>> > to confirm?
> >>>> >
> >>>> > 2015-06-16 11:12 GMT+08:00 Shi, Shaofeng <[email protected]>:
> >>>> >
> >>>> > > Yes it is expected; The ³derived² columns are those can be derived
> >>>> by
> >>>> > fks;
> >>>> > > So in the cube we only keep the *_ids in dimensions; Kylin will
> take
> >>>> > > snapshot for the lookup table and load them into memory, in
> >>>> run-time it
> >>>> > > can map the ids to the names;
> >>>> > >
> >>>> > >
> >>>> > > On 6/15/15, 4:34 PM, "dong wang" <[email protected]> wrote:
> >>>> > >
> >>>> > > >====================Hive tables========================
> >>>> > > >fact table: ft
> >>>> > > >region_id
> >>>> > > >province_id
> >>>> > > >city_id
> >>>> > > >
> >>>> > > >lookup table: lk1
> >>>> > > >region_id
> >>>> > > >region_name
> >>>> > > >province_id
> >>>> > > >province_name
> >>>> > > >city_id
> >>>> > > >city_name
> >>>> > > >
> >>>> > > >
> >>>> > > >====================Cube Design========================
> >>>> > > >
> >>>> > > >JOIN:   inner
> >>>> > > >
> >>>> > > >ft.region_id=lk1.region_id
> >>>> > > >and ft.province_id=lk1.province_id
> >>>> > > >and ft.city_id=lk1.city_id
> >>>> > > >
> >>>> > > >
> >>>> > > >Dimension
> >>>> > > >
> >>>> > > >Hierarhy 1:  region_id -> province_id -> city_id  (NOTE: all of
> >>>> these 3
> >>>> > > >columns come from fact table: ft)
> >>>> > > >Derived 1: region_id, region_name (NOTE: it comes from lookup
> >>>> table:
> >>>> > lk1)
> >>>> > > >Derived 2: province_id, province_name(NOTE: it comes from lookup
> >>>> table:
> >>>> > > >lk1)
> >>>> > > >Derived 3: city_id, city_name(NOTE: it comes from lookup table:
> >>>> lk1)
> >>>> > > >
> >>>> > > >
> >>>> > > >is it correct to create the derived dimensions?
> >>>> > > >
> >>>> > > >
> >>>> > > >====================Cube Building========================
> >>>> > > >Build the cube:
> >>>> > > >
> >>>> > > >when the first step, the log says as below:
> >>>> > > >
> >>>> > > >CREATE EXTERNAL TABLE IF NOT EXISTS
> >>>> > >
> >>>> >
> >>>>
> >kylin_intermediate_test_cube_1_20150101000000_20150601000000_635c985b_3212
> >>>> > > >_4dd5_9ffa_f100e4c323e1
> >>>> > > >(
> >>>> > > >test_cube_1_DT string
> >>>> > > >,test_cube_1_region_id int
> >>>> > > >,test_cube_1_province_id int
> >>>> > > >,test_cube_1_city_id int
> >>>> > > >,test_cube_1_m1 bigint
> >>>> > > >,test_cube_1_m2 bigint
> >>>> > > >)
> >>>> > > >ROW FORMAT DELIMITED FIELDS TERMINATED BY '\177'
> >>>> > > >STORED AS SEQUENCEFILE
> >>>> > > >
> >>>> > > >
> >>>> > > >we can see that "region_name", "province_name", "city_name" don't
> >>>> appear
> >>>> > > >in
> >>>> > > >the intermediate hive table created in the 1st step, instead, it
> >>>> only
> >>>> > > >contains "*_id" columns as listed above,  is it expected?
> >>>> > >
> >>>> > >
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >
>

Re: is it expected when building a cube with join?

Reply via email to