Re: FW: [HACKERS] Surrogate keys (Was: enums)

Dann Corbit Wed, 18 Jan 2006 17:34:51 -0800

Maybe it goes better into Advocacy or something, but I have found a quote by 
database big-wigs that I strongly disagree with:


From:
http://www.db.ucsd.edu/cse132B/Thirdmanifesto.pdf

We have this.
"PROPOSITION 1.4: Unique Identifiers (UIDs) for records should be assigned by 
the DBMS only if a user-defined primary key is not available.
Second generation systems support the notion of a primary key, which is a 
user-assigned unique identifier.  If a primary key exists for a collection that 
is known never to change, for example social security number, student 
registration number, or employee number, then no additional system-assigned UID 
is required. An immutable primary key has an extra advantage over a 
system-assigned unique identifier because it has a natural, human readable 
meaning. Consequently, in data interchange or debugging this may be an 
advantage.  If no primary key is available for a collection, then it is 
imperative that a system-assigned UID be provided. Because SQL supports update 
through a cursor, second generation systems must be able to update the last 
record retrieved, and this is only possible if it can be uniquely identified. 
If no primary key serves this purpose, the system must include an extra UID. 
Therefore, several second generation systems already obey this proposition. 
Moreover, as will be noted in Proposition 2.3, some collections, e.g. views, do 
not necessarily have system assigned UIDs, so building a system that requires 
them is likely to be proven undesirable. We close our discussion on Tenet 1 
with a final proposition that deals with the notion of rules."

This is a bad idea.

Let's take the example of a Social Security Number.

Not everyone has one:
http://www.ssa.gov/pubs/10002.html#how2

If people do have one, they can definitely change it.  If someone has stolen a 
SSN, then the wronged party is able to get their SSN changed:
http://101-identitytheft.com/ssn.htm
The odds of this happening are low, but if you cannot handle it, then the 
damage caused is considerable.
Now what happens if you want to have customers outside of the USA? {Don't 
worry, we'll never go global...}  I hope that my objections are very plain and 
obvious.

The primary key should be immutable, meaning that its value should not be 
changed during the course of normal operations of the database.  What natural 
key is immutable?  The answer is that such an attribute does not exist.  To use 
them for such a purpose is begging for trouble.

I saw the argument that there is a great volume of space wasted by adding a 
column that does not naturally occur in the data.  That argument is simply 
absurd.  Consider a database with 10 billion rows of data in it.  Each of those 
tables gets an 8 byte primary key added for every row, resulting in 80 GB 
consumed.  The cost of 80 GB is perhaps $200.  With a database that large 
(where the extra space consumed by an artificial key column has a cost that can 
easily be measured) the odds of a problem arising due to a natural column 
changing its value are huge.  The cost of such a tragedy is certainly more than 
the $200 pittance!

If there is an argument that we also have the parent key values propagated into 
the child tables as foreign keys, that argument has no merit.  The other 
attribute that would have been chosen would also be propagated.  And so (for 
instance) there is no savings to propagating a SSN field into child tables 
verses propagating an 8 byte integer.

I also saw an argument that the propagated ID values are confusing to 
end-users.  That is the fault of the database designer who game them a stupid 
name.  If they were things like InvoiceID and LineItemID then there will not be 
the same sort of confusion.  The meaning and purpose of the column is 
immediately apparent.  As an alternative, the ubiquitous OID name for a column 
on a table is also very transparent.  Of course, when it is used in a foreign 
key, it must be given a role name to avoid confusion in that case.

At any rate, the use of natural keys is a mistake made by people who have never 
had to deal with very large database systems.

IMO-YMMV.


> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
> [EMAIL PROTECTED] On Behalf Of Leandro Guimarães Faria Corcete Dutra
> Sent: Wednesday, January 18, 2006 4:31 PM
> To: Jim C. Nasby
> Cc: [email protected]
> Subject: Re: FW: [HACKERS] Surrogate keys (Was: enums)
> 
> Em Qua, 2006-01-18 às 17:22 -0600, Jim C. Nasby escreveu:
> > >
> > >   Forgive me my ignorance, but are ints inherently faster to compare
> than
> > >strings, or is it just an implementation detail?  Ideally, if this is
> so
> > >a fully data-independent system would create a hash behind the back of
> > >user in order to get performance.
> >
> > The CPU can do an integer comparison with one instruction; it can't do
> > that with a text string.
> 
>       OK.  Again, data independence should be the goal here.
> 
> 
> > >   OK, hardly a typical example.  As I think I left clear, my problem
> is
> > >not using surrogate keys, but using them by default, or even
> > >exclusively.
> >
> > No? It's certainly not uncommon to have tables with 100M+ rows.
> 
>       No, but neither are they *that* common.
> 
>       Certainly, lots of database have a few of them.  But then, they have
> dozens, hundreds, thousands of much smaller tables.
> 
> 
> > And keep
> > in mind that this applies to every row of every table that has foreign
> > keys. I'd bet it's actually common to save 1G or more with surrogate
> > keys in moderately sized databases.
> 
>       Only if you have quite some children, because otherwise, in the main
> tables, the surrogate keys add a field, an index and a sequence to an
> otherwise smaller table and index.
> 
> 
> > Of course, you do have to be intelligent here, too. The only key defined
> > on the table in my example is participant_id, project_id, date; there is
> > no surrogate key because there's no real reason to have one.
> 
>       Quite.
> 
> 
> > >> (In actuality, there isn't participant_name... participants are
> > >> identified by email address (not a great idea, but I wasn't around
> > >> when that was chosen). As you can imagine, email addresses are
> > >> substantially longer than 4 bytes. When we normalized email out of
> > >> that main table things got substantially faster. That was a number of
> > >> years ago, so the table was probably 15-25% of it's current size, but
> > >> it still made a huge difference.)
> > >
> > >   This isn't normalisation at all, as far as I understand it.  It is
> just
> >
> > I don't have the rules of normalization memorized enough to know what
> > form this breaks, but I'm 99% certain it breaks at least one of them.
> 
>       No, never.
> 
>       Normalisation is about eliminating redundancy and, therefore, update
> anomalies.  Making all the table dependent on only the keys and the
> whole keys, by projecting relations to eliminate entity mixups.
> 
>       What you mention is actually exposing an implementation detail,
> namely
> an integer that serves as a hash of the key.
> 
> 
> > Look at it this way: if someone wants to change their email address,
> > best case scenario is that you have cascading RI setup and it updates
> > thousands of rows in that table. Worst case scenario, you just de-linked
> > a whole bunch of data. But with a surrogate key, all you have to do is
> > update one row in one table and you're done.
> 
>       OK, if you have lots of linked data.  But most tables are really
> dead
> ends.
> 
> 
> > >that we don't have data independence... so you had to expose an
> > >implementation detail?
> >
> > Expose to what? The application? First, this is a pretty minor thing to
> > expose; second, if it's that big a concern you can completely hide it by
> > using a view.
> 
>       As someone said, you end up with ids everywhere, and no
> user-understandable data at all...
> 
> 
> > But the reality is, dealing with a numeric ID can be a heck of a lot
> > easier than an email address. Look at URLs that embbed one versus the
> > other for a good example.
> 
>       Again, implementation details... levels mixup.
> 
> 
> > >> Why should it? It's trivial to create views that abstract surrogate
> > >> keys out, and if you really want to you can even make the views
> > >> updatable. But here's two other things to consider:
> > >
> > >   These views, in heavy querying environments, can be prohibitive.
> >
> > "Normalize 'til it hurts; denormalize 'til it works."
> 
>       Lack of data implementation biting us again.
> 
> 
> > Yes, the added overhead of rules for updates/inserts/deletes could start
> > to add up in performance-critical code. But if performance is that
> > critical you're far more likely to run into other bottlenecks first. And
> > worst-case, you abstract behind a stored procedure that just has the
> > right queries hard-coded.
> >
> > As for select-only views you'll have a hard time showing any meaningful
> > performance penalty.
> 
>       Yet real user-defined data types could make it all much simpler.
> 
> --
> +55 (11) 5685 2219               xmpp:[EMAIL PROTECTED]
> +55 (11) 9406 7191                       Yahoo!: lgcdutra
> +55 (11) 5686 9607         MSN: [EMAIL PROTECTED]
> +55 (11) 4390 5383                      ICQ/AIM: 61287803
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to [EMAIL PROTECTED] so that your
>        message can get through to the mailing list cleanly

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Re: FW: [HACKERS] Surrogate keys (Was: enums)

Reply via email to