Re: data warehousing desing - to denormalize or not to denormaliz

Steven Lembark Fri, 10 May 2002 18:12:05 -0700

-- Henry Poras <[EMAIL PROTECTED]>

>> From what I remember, one of the main advantages of a normalized
>> database is
> a certain level of data integrity and flexibility. The integrity comes
> from foreign keys, lack of duplicate data, and generally a data design
> based on characteristics of the data. The flexibility comes from the
> ablility to query in many different ways (you are not restricted as you
> are in a heirarchical db). Though I have never designed a data warehouse,
> I always thought that the integrety piece would be less of an issue.
> Since there are very few writes to a warehouse, there is less chance to
> fowl up the data integrity as long as there are good controls on the
> load. Flexibility on querying a warehouse seems to also be possible using
> other design methods (fact tables, snowflake, ...)

Basic warehouse operations use an offline update with
read-only access to users. The offline update allows
for complete validation before the data is used (in
theory at least), which makes foreign keys less important
for maintaing consistency. The normal ETL cycle is
usually designed to reject data with missing keys in
the transform cycle anyway -- usually by placing the
offending items in a reject que for later analysis.

Avoiding snowflakes makes any foreign key issues moot
in the dimensions: the ETL cycle will either merge
the data successfully or reject it. The fact table is
mostly key in the first place, with a small amount of
data hung off the side. Inserting the record requires
pre-merging all of the key information also, so at
insert time the records should be valid (or the ETL
code reworked).

An RDBMS allows the same queries in both a star or
3rd (or BCD) normal form. The difference is that a
star schema is much simpler to query since the joins
are only 1-2 levels deep (dims:fact or dim:fact:dim
is as far as it goes). The joins tend to be faster
also since indexes are 1:1, the dim's are shorter
tables and the fact is "narrow" enough that they
don't require excessive I/O to process down their
length.

In most cases a synthetic integer key is used for the
dim's also. This is partly done to keep the fact table
narrow, since most of it is the composite key for each
fact. This tends to help indexes also.

Depending on the database, joining dimensions across the
fact table is also more effecient. Red Brick was the
first with their "star index", which basically pre-joins
the dimension and fact records at load time. Informix
and DB2 picked up the technology by purchasing RB and
Informix; Oracle is currently working on a similar
concept (I think). This basically trades off a fairly
expensive operation done once at load time for read
effeciency. This works in a warehousing environment
where the ETL cycle can pre-sort records for better
load speed and the loads happen offline on basically a
dedicated system. After that read effeciency is the
only thing going, noone cares about update speed until
tomorrows load -- at which point they don't really
care about user effeciency for a while.

The cycle works pretty well in most cases, the biggest
problem being the management of rollof. If the data can
be segmented in "rollof units" (e.g., time buckets) then
the process is simple. In Oracle, for example, with
locally managed partitions you can offline, truncate and
drop them without any real pain. If the unit of partitioning
isn't the unit of rollof -- not hard since the primary key
has to start with the partitioning field in nearly all
cases -- then you end up having to perform deletes. THAT
can be a Real, True Pain (tm) on a 3-4 TByte warehouse.
It's even worse in cases where the rollof units are not
uniformly distributed, in which case Oracle will have to
perform a table scan to find the records. In most cases
the simplest fix it to force the rollof value into the
primary key and be done with it, hopefully that doesn't
screw up the database.

Data marting helps this in some ways, since the ETL
process can pre-generate the aggregates required for
marting. In that case a smaller database gets updated
with a smaller load each cycle and can usually be placed
online quicker. The marts can also hold data longer without
degrading performance, so their rollof cycles tend to be
longer. Many databases will have pre-aggregated data
prepared for drilldowns already loaded into the fact
table (this is a requirement of some querying tools). In
that case the "rollof" procedure consists of deleting
the more detailed period records (say daily totals) on
a monthly basis, leving monthly aggregate values online
for perhaps 24 months. The delete cycle is less painful
in a -- much smaller -- data mart than the whole warehouse
and leaves users able to make the buisness-cycle queries
they need.

--
Steven Lembark                               2930 W. Palmer
Workhorse Computing                       Chicago, IL 60647
                                            +1 800 762 1582
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Steven Lembark
  INET: [EMAIL PROTECTED]

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).
Re: data warehousing desing - to denormalize or not to denormaliz

Reply via email to