On Thu, 25 Apr 2002, Bruce Momjian wrote: > Actually, this brings up a different point. We use 8k blocks now > because at the time PostgreSQL was developed, it used BSD file systems, > and those prefer 8k blocks, and there was some concept that an 8k write > was atomic, though with 512 byte disk blocks, that was incorrect. (We > knew that at the time too, but we didn't have any options, so we just > hoped.)
MS SQL Server has an interesting way of dealing with this. They have a "torn" bit in each 512-byte chunk of a page, and this bit is set the same for each chunk. When they are about to write out a page, they first flip all of the torn bits and then do the write. If the write does not complete due to a system crash or whatever, this can be detected later because the torn bits won't match across the entire page. > Now, with larger RAM and disk sizes, it may be time to consider larger > page sizes, like 32k pages. That reduces the granularity of the cache, > but it may have other performance advantages that would be worth it. It really depends on the block size your underlying layer is using. Reading less than that is never useful as you pay for that entire block anyway. (E.g., on an FFS filesystem with 8K blocks, the OS always reads 8K even if you ask for only 4K.) On the other hand, reading more does have a tangible cost, as you saw from the benchmark I posted; reading 16K on my system cost 20% more than reading 8K, and used twice the buffer space. If I'm doing lots of really random reads, this would result in a performance loss (due to doing more I/O, and having less chance that the next item I want is in the buffer cache). For some reason I thought we had the ability to change the block size that postgres uses on a table-by-table basis, but I can't find anything in the docs about that. Maybe it's just because I saw some support in the code for it. But this feature would be a nice addition for those cases where a larger block size would help. But I think that 8K is a pretty good default, and I think that 32K blocks would result in a quite noticable performance reduction for apps that did a lot of random I/O. > What people are actually suggesting with the read-ahead for sequential > scans is basically a larger block size for sequential scans than for > index scans. While this makes sense, it may be better to just increase > the block size overall. I don't think so, because the smaller block size is definitely better for random I/O. cjs -- Curt Sampson <[EMAIL PROTECTED]> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]