On Mon, Mar 27, 2017 at 12:12 PM, Peter Geoghegan <p...@bowt.ie> wrote: > On Sun, Mar 26, 2017 at 3:41 PM, Thomas Munro > <thomas.mu...@enterprisedb.com> wrote: >>> 1. Segments are what buffile.c already calls the individual >>> capped-at-1GB files that it manages. They are an implementation >>> detail that is not part of buffile.c's user interface. There seems to >>> be no reason to change that. >> >> After reading your next email I realised this is not quite true: >> BufFileTell and BufFileSeek expose the existence of segments. > > Yeah, that's something that tuplestore.c itself relies on. > > I always thought that the main reason practical why we have BufFile > multiplex 1GB segments concerns use of temp_tablespaces, rather than > considerations that matter only when using obsolete file systems: > > /* > * We break BufFiles into gigabyte-sized segments, regardless of RELSEG_SIZE. > * The reason is that we'd like large temporary BufFiles to be spread across > * multiple tablespaces when available. > */ > > Now, I tend to think that most installations that care about > performance would be better off using RAID to stripe their one temp > tablespace file system. But, I suppose this still makes sense when you > have a number of file systems that happen to be available, and disk > capacity is the main concern. PHJ uses one temp tablespace per worker, > which I further suppose might not be as effective in balancing disk > space usage.
I was thinking about IO bandwidth balance rather than size. If you rotate through tablespaces segment-by-segment, won't you be exposed to phasing effects that could leave disk arrays idle for periods of time? Whereas if you assign them to participants, you can only get idle arrays if you have fewer participants than tablespaces. This seems like a fairly complex subtopic and I don't have a strong view on it. Clearly you could rotate through tablespaces on the basis of participant, partition, segment, some combination, or something else. Doing it by participant seemed to me to be the least prone to IO imbalance cause by phasing effects (= segment based) or data distribution (= partition based), of the options I considered when I wrote it that way. Like you, I also tend to suspect that people would be more likely to use RAID type technologies to stripe things like this for both bandwidth and space reasons these days. Tablespaces seem to make more sense as a way of separating different classes of storage (fast/expensive, slow/cheap etc), not as an IO or space striping technique. I may be way off base there though... -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers