[HACKERS] Multiple Storage per Tablespace, or Volumes

Dimitri Fontaine Mon, 19 Feb 2007 02:57:21 -0800

Hi list,

Here's a proposal of this idea which stole a good part of my night.
I'll present first the idea, then 2 use cases where to read some rational and 
few details. Please note I won't be able to participate in any development 
effort associated with this idea, may such a thing happen!


The bare idea is to provide a way to 'attach' multiple storage facilities (say 
volumes) to a given tablespace. Each volume may be attached in READ ONLY, 
READ WRITE or WRITE ONLY mode.
You can mix RW and WO volumes into the same tablespace, but can't have RO with 
any W form, or so I think.

It would be pretty handy to be able to add and remove volumes on a live 
cluster, and this could be a way to implement moving/extending tablespaces.


Use Case A: better read performances while keeping data write reliability

The first application of this multiple volumes per tablespace idea is to keep 
a tablespace both into RAM (tmpfs or ramfs) and on disk (both RW).

Then PG should be able to read from both volumes when dealing with read 
queries, and would have to fwrite()/fsync() both volumes for each write.
Of course, write speed will be constrained by the slowest volume, but the 
quicker one would then be able to take away some amount of read queries 
meanwhile.

It would be neat if PG was able to account volumes relative write speed in 
order to assign pounds to each tablespace volumes; and have the planner or 
executor span read queries among volumes depending on that.
For example if a single query has a plan containing several full scan (of 
indexes and/or tables) in the same tablespace, those could be done on 
different volumes. 

Use Case B: Synchronous Master Slave(s) Replication

By using a Distributed File System capable of being mounted from several nodes 
at the same time, we could have a configuration where a master node has 
('exports') a WO tablespace volume, and one or more slaves (depending on FS 
capability) configures a RO tablespace volume.

PG has then to be able to cope with a RO volume: the data are not written by 
PG itself (local node point of view), so some limitations would certainly 
occur.
Will it be possible, for example, to add indexes to data on slaves?
I'd use the solution even without this, thus...

When the master/slave link is broken, the master can no more write to 
tablespace, as if it was a local disk failure of some sort, so this should 
prevent nasty desync' problems: data is written on all W volumes or data is 
not written at all.


I realize this proposal is the first draft of a work to be done, and that I 
won't be able to make a lot more than drafting this idea. This mail is sent 
on the hackers list in the hope someone there will find this is worth 
considering and polishing...

Regards, and thanks for the good work ;)
-- 
Dimitri Fontaine

pgp4lUKkfwp0p.pgp
Description: PGP signature

[HACKERS] Multiple Storage per Tablespace, or Volumes

Reply via email to