Hi list, Here's a proposal of this idea which stole a good part of my night. I'll present first the idea, then 2 use cases where to read some rational and few details. Please note I won't be able to participate in any development effort associated with this idea, may such a thing happen!
The bare idea is to provide a way to 'attach' multiple storage facilities (say volumes) to a given tablespace. Each volume may be attached in READ ONLY, READ WRITE or WRITE ONLY mode. You can mix RW and WO volumes into the same tablespace, but can't have RO with any W form, or so I think. It would be pretty handy to be able to add and remove volumes on a live cluster, and this could be a way to implement moving/extending tablespaces. Use Case A: better read performances while keeping data write reliability The first application of this multiple volumes per tablespace idea is to keep a tablespace both into RAM (tmpfs or ramfs) and on disk (both RW). Then PG should be able to read from both volumes when dealing with read queries, and would have to fwrite()/fsync() both volumes for each write. Of course, write speed will be constrained by the slowest volume, but the quicker one would then be able to take away some amount of read queries meanwhile. It would be neat if PG was able to account volumes relative write speed in order to assign pounds to each tablespace volumes; and have the planner or executor span read queries among volumes depending on that. For example if a single query has a plan containing several full scan (of indexes and/or tables) in the same tablespace, those could be done on different volumes. Use Case B: Synchronous Master Slave(s) Replication By using a Distributed File System capable of being mounted from several nodes at the same time, we could have a configuration where a master node has ('exports') a WO tablespace volume, and one or more slaves (depending on FS capability) configures a RO tablespace volume. PG has then to be able to cope with a RO volume: the data are not written by PG itself (local node point of view), so some limitations would certainly occur. Will it be possible, for example, to add indexes to data on slaves? I'd use the solution even without this, thus... When the master/slave link is broken, the master can no more write to tablespace, as if it was a local disk failure of some sort, so this should prevent nasty desync' problems: data is written on all W volumes or data is not written at all. I realize this proposal is the first draft of a work to be done, and that I won't be able to make a lot more than drafting this idea. This mail is sent on the hackers list in the hope someone there will find this is worth considering and polishing... Regards, and thanks for the good work ;) -- Dimitri Fontaine
Description: PGP signature