So when you create a clustered index on a database table, the point is
to make it possible to read a lot of records with some particular
value or range of some attribute quickly --- without seeking --- or
alternatively to make it possible to process all the records in a
certain order without seeking much.

The trouble is, you can only create a clustered index on one
attribute, at most.  (You can create a clustered index on a set of
attributes, but like an index on the set of attributes, items with
identical values for the second and later attributes are likely to be
scattered all over the place.) And if you pick the wrong one, you get
remarkably bad performance for everything else, once your table gets
big.

There's a disk performance characteristic of interest here: the
product of the sustained data transfer rate and the latency Typical
values for current low-end disks are 20MB/s * (8 ms seek + 5 ms
rotational latency) are around a quarter of a megabyte.  I'll call
this the "bytes per seek" characteristic.

When the amount of data you read or write sequentially in a single
location is small compared to the bytes per seek, the disk spends most
of its time seeking; when it's equal to the bytes per seek, the disk
divides its time half-and-half between seeking and reading or writing;
and when it's large compared to the bytes per seek, the disk spends
relatively little time seeking and therefore gets close to its peak
throughput.

Barring brain damage, clustered indices don't matter at all (for disk
time) when the entire table's data is small compared to the bytes per
seek --- there's no reason to not read the whole table into memory
once you've done the work of seeking to it, so it doesn't matter very
much what the order on disk is.  (It might matter for other levels of
the memory hierarchy, but I doubt it.)

But once your table starts to get large compared to the bytes per
seek, clustered indices can matter a lot, down to chunks of the size
of the bytes per seek.  The order of data inside bytes-per-seek-sized
chunks generally doesn't matter much, but the number of such chunks a
query has to access essentially determines the speed of the query.

There's an alternative to clustered indices for small numbers of
attributes.  You can think of each attribute as a spatial dimension,
and divide the multidimensional space up in a binary kd-tree of
bytes-per-seek-sized chunks: each node in the k-d tree has a partition
value on some attribute which divides the remaining items roughly half
and half.  If you have, say, two attributes you want to cluster on,
and your data is 64 times the bytes-per-seek number, you need 6 levels
of division; and you can read all the records with a particular value
for either attribute with about 8 seeks.  This compares poorly to the
single seek you'd need with a pure clustered index, but it compares
well to the 64 you'd (probably) need with a clustered index on the
other attribute.

This kdtree technique gets less useful as the number of dimensions
grows.  If you cluster by 3 attributes, your 64-chunk tree is now
4x4x4 instead of 8x8, so you need about 16 seeks to find all the data
with some particular attribute --- only a factor of 2 faster than
reading all the data.  And these numbers I've been touting as average
cases are probably more like best cases; they could get knackered by
attributes with some individual values that account for a large
proportion of the data.

To take a larger number, if you have a gigabyte of data --- 4096 or so
times the bytes-per-seek number --- you can kd-cluster it 64x64,
16x16x16, 8x8x8x8, or 4x4x4 x 4x4x4, yielding 64, 256, 512, and 1024
expected seeks, respectively, or concretely, about 1.6, 6.4, 12.8, and
25.6 seconds (considering that each seek has an accompanying
equally-long read, so you get about 10MB/s on average.)  Reading the
entire gigabyte only takes 50 seconds.  (On my laptop, it's actually
about 80, but that's close enough.)

You can roughly double the speed for one of the attributes you choose
by storing bytes-per-seek-sized blocks covering roughly the same range
of that attribute consecutively on disk, eliminating seeks between
them; or you can distribute the reduction in the number of seeks
across all the attributes by just storing the blocks in the order
produced by a tree traversal (so adjacent leaves are likely to sit
next to each other).  You can increase the speed bonus for that
attribute linearly, at the cost of slowing everyone else down
linearly, by using smaller blocks.  Reducing the blocks to one record
in size gives you ordinary clustered indexing.

Reply via email to