Re: [SQL] Warehouse Schema

Richard Huxton Thu, 25 May 2006 02:56:08 -0700

Worky Workerson wrote:

I'm developing a schema for a large data warehouse (10 billion records) and
had a couple of questions about how to optimize it.  My biggest question is
on the design of the fact table and how much normalization I shouldactually
do of the data.


As much as you can and still afford the hardware to support it.

The data is going to be keyed by IP address, stored as a non-unique IP4
type.  Each record is also tagged with various attributes, such as a
category and a type.  Assuming that a category and type are VARCHAR, would

it make sense to normalize these out of the fact table into theirrespective

tables and key them by an INTEGER? I.e.

CREATE TABLE big_fact_table_A (
   identifier IP4,
   data1   BYTEA,
   data2   BYTEA,
   ...
   dataN  BYTEA,
   category    VARCHAR(16),
   type        VARCHAR(16)
);

... vs ...

CREATE TABLE big_fact_table_B (
   identifier IP4,
   data1   BYTEA,
   data2   BYTEA,
   ...
   dataN  BYTEA,
   category    INTEGER REFERENCES categories (category_id),
   type        INTEGER REFERENCES types (type_id)
);

Assuming categories and types are fixed sets this isn't reallynormalisation, you're just substituting one representation with another.That's assuming you enforce valid representations on table_A.

Of course, if the categories or types can change then you'll want to useforeign-keys. Try int2 rather than int4 even (although check themanuals for details to see if it'll pack them closer).

I figure that the normalized fact table should be quicker, as theinteger ismuch smaller than the varchar. On query, however, the table will needto be
joined against the two other tables (categories, types), but I still figure
that this is a win because the other tables are fairly small and shouldstay
resident in memory.  Is this reasoning valid?


That's the trade-off.

The downside to this (from my perspective) is that the data comes in the
form of big_fact_table_A and could be easily COPYed straight into the
table.  with big_fact_table_B it looks like I will have to do the "unJOIN"
in a script.  Also, I have separate installations of the warehouse (with
different data sources) and it will be difficult to share data between them

unless their categories/types tables are keyed with exactly the sameinteger

IDs which, as I don't directly control the other installations, is not
guaranteed.

Any suggestions to the above "problems"?

If you're comfortable with a little "C" coding, write your own type.That's really what you're trying to do here. Accept text input anddisplay text output but store in a compact form.


HTH
--
  Richard Huxton
  Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to [EMAIL PROTECTED] so that your
      message can get through to the mailing list cleanly

Re: [SQL] Warehouse Schema

Reply via email to