Re: [PERFORM] Partitioning / Clustering

Mischa Sandberg Tue, 10 May 2005 15:12:25 -0700

Quoting Alex Stapleton <[EMAIL PROTECTED]>:

> This is why I mention partitioning. It solves this issue by storing 
> different data sets on different machines under the same schema.  
> These seperate chunks of the table can then be replicated as well for
> data redundancy and so on. MySQL are working on these things, but PG 
> just has a bunch of third party extensions, I wonder why these are  
> not being integrated into the main trunk :/ Thanks for pointing me to
> PGCluster though. It looks like it should be better than Slony at
> least.


Across a decade or two of projects, including creating a federated
database engine for Simba, I've become rather dubious of horizontal
partitions (across disks or servers), either to improve performance, or
just to scale up and not lose performance. [[The one exception is for
<emphasis> non-time-critical read-only</emphasis> systems, with
Slony-style replication.]]

The most successful high-volume systems I've seen have broken up
databases functionally, like a pipeline, where different applications
use different sections of the pipe. 

The highest-volume system I've worked on is Acxiom's gigantic
data-cleansing system. This is the central clearinghouse for every scrap
of demographic that can be associated with some North American,
somewhere. Think of D&B for 300M people (some dead). The volumes are
just beyond belief, for both updates and queries. At Acxiom, the
datasets are so large, even after partitioning, that they just
constantly cycle them through memory, and commands are executes in
convoys --- sort of like riding a paternoster.
..........
Anybody been tracking on what Mr Stonebraker's been up to, lately?
Datastream management. Check it out. Like replication, everybody
hand-rolled their own datastream systems until finally somebody else
generalized it well enough that it didn't have to be built from scratch
every time.

Datastream systems require practically no locking, let alone distributed
transactions. They give you some really strong guarantees on transaction
elapsed-time and throughput. 
.......
Where is this all leading? Well, for scaling data like this, the one
feature that you need is the ability of procedures/rules on one server
to perform queries/procedures on another. MSSQL has linked servers and
(blech) OpenQuery. This lets you do reasonably-efficient work when you
only deal with one table at a time. Do NOT try anything fancy with
multi-table joins; timeouts are unavoidable, and painful.

Postgres has a natural advantage in such a distributed server system:
all table/index stats are openly available through the SQL interface,
for one server to make rational query plans involving another server's
resources. God! I would have killed for that when I was writing a
federated SQL engine; the kluges you need to do this at arms-length from
that information are true pain.

So where should I go look, to see what's been done so far, on a Postgres
that can treat another PG server as a new table type?



---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match

Re: [PERFORM] Partitioning / Clustering

Reply via email to