FW: [HACKERS] Surrogate keys (Was: enums)

Jim Nasby Wed, 18 Jan 2006 14:47:44 -0800

Ooops, fat-finger'd -hackers...

-----Original Message-----
Adding -hackers back to the list.

> From: Leandro Guimarães Faria Corcete Dutra
> Em Seg, 2006-01-16 às 12:52 -0600, Jim C. Nasby escreveu:
> > On Sat, Jan 14, 2006 at 07:28:21PM +0900, Michael Glaesemann wrote:
> > > 
> > > For UPDATEs and INSERTs, the "proper" primary key also 
> needs to be  
> > > checked, but keys are used for more than just checking 
> uniqueness:  
> > > they're also often used in JOINs. Joining against a 
> single integer  
> > > I'd think it quite a different proposition (I'd think 
> faster in terms  
> > > of performance) than joining against, say, a text column or a  
> > > composite key.
> 
>       How different is that?

Comparing two ints is much, much faster than comparing two text fields. For a 
small number of comparisons, it doesn't matter. When you're joining tables 
together, it's a different story.

> > a) the optimizer does a really poor job on multi-column 
> index statistics
> 
>       Then it should eventually be fixed?

It's on the to-do, but it's not an easy nut to crack.

> > b) If each parent record will have many children, the space 
> savings from
> > using a surrogate key can be quite large
> 
>       Only where the surrogate is significantly smaller than 
> the natural?

#define significant

Here's a real-life example: the primary table for stats.distributed.net has 
about 120M rows. One field in that table (participant_id) links back to the 
participant table; it's an int. If instead we used participant_name and that 
averaged 8 characters in length, that would grow the main table by 1GB (8 chars 
takes 8 bytes instead of 4, plus there's the varlena header of 4 bytes). The 
machine that stats runs on has 4G of memory, so cutting 1G of wasted space out 
of that table helps quite a bit.

(In actuality, there isn't participant_name... participants are identified by 
email address (not a great idea, but I wasn't around when that was chosen). As 
you can imagine, email addresses are substantially longer than 4 bytes. When we 
normalized email out of that main table things got substantially faster. That 
was a number of years ago, so the table was probably 15-25% of it's current 
size, but it still made a huge difference.)

> > c) depending on how you view things, putting actual keys 
> all over the
> > place is denormalized
> 
>       How come?

See my other reply... :)

> > Generally, I just use surrogate keys for everything unless 
> performance
> > dictates something else.
> 
>       Shouldn't it be the other way round, for the user's sake?

Why should it? It's trivial to create views that abstract surrogate keys out, 
and if you really want to you can even make the views updatable. But here's two 
other things to consider:

In many cases you can't define a single field as a unique key. So you end up 
with having to add many extra keys to all your join clauses. Not very friendly, 
and prone to error.

Not every language has equal support for text comparisons (and in my 
experience, almost all real keys are mostly text).
--
Jim C. Nasby, Sr. Engineering Consultant      [EMAIL PROTECTED]
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

FW: [HACKERS] Surrogate keys (Was: enums)

Reply via email to