On Thu, Oct 30, 2008 at 2:06 PM, Christiaan Willemsen < [EMAIL PROTECTED]> wrote:
> Thanks guys, > Lots of info here that I didn't know about! Since I have one of the latest > Opensolaris builds, I guess the write throttle feature is already in there. > Sadly, the blog doesn't say what build has it included. > If I recall correctly, it went in at about build 89 or so (I think the bottom of the link I provided has a comment to that effect). So its in there now, but not in OpenSolaris 2008.05. > > For writes, I do everything synchronized because we really need a > consistent database on disk. We can see that during large inserts, the > intend log is used a lot. > The DB synchronizes the WAL log automatically, and the table and index data are written non-synchronously until the commit at the end of a checkpoint, in which case sync is called on them. This keeps things consistent on disk. With ZFS, each block written is always consistent, with a checksum kept in the parent block. There are no partial page writes, ever. In theory, you can disable full page writes on the WAL log if there is a bottleneck there since ZFS guarantees fully transactional consistent state of the file system, even if you have a RAID controller or hardware failure that causes a partial write. But WAL log activity is probably not your bottleneck so turning off full page writes on the WAL log is not necessary. > > What I'm going to te testing is a smaller shared_buffers value, and a > large ARC cache, and exactly the other way around. > > Another question: since we have huge tables with hundreds of millions or > rows, we partitioned the database (it actually is creating the partitions > dynamically now on inserts with very good performance :D ), but the question > is: is the size of the partions important for the memory parameters in > config file? How can we determine the optimal size of the partition. > obviously, when doing selects, you want those preferably only needing a > single partition for speed. At the moment, that is for the majority of > situations the case. But there might be some other things to think about... > > Kind regards, > > Christiaan > > > On Oct 30, 2008, at 7:27 PM, Scott Carey wrote: > > > > On Thu, Oct 30, 2008 at 9:55 AM, Joshua D. Drake <[EMAIL PROTECTED]>wrote: > >> On Thu, 2008-10-30 at 09:46 -0700, Scott Carey wrote: >> >> > >> > Remember that PostgreSQL doesn't cache anything on its own so >> > if you do >> > want to hit disk it has to be in file cache. >> > >> > By my understanding, this is absolutely false. Postgres caches pages >> > from tables/indexes in shared_buffers. You can make this very large if >> > you wish. >> >> You can make it very large with a potentially serious performance hit. >> It is very expensive to manage large amounts of shared buffers. It can >> also nail your IO on checkpoint if you are not careful (even with >> checkpoint smoothing). You are correct that I did not explain what I >> meant very well because shared buffers are exactly that, shared >> buffers. > > > You can slam your I/O by havnig too large of either OS file cache or > shared_buffers, and you have to tune both. > In the case of large shared_buffers you have to tune postgres and > especially the background writer and checkpoints. > In the case of a large OS cache, you have to tune parameters to limit the > ammount of dirty pages there and force writes out smoothly. > Both layers attempt to delay writes for their own, often similar reasons, > and suffer when a large sync comes along with lots of dirty data. > > Recent ZFS changes have been made to limit this, ( > http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle) > in earlier ZFS versions, this is what usually killed databases -- ZFS in > some situations would delay writes too long (even if "long" is 5 seconds) > and get in trouble. This still has to be tuned well, combined with good > checkpoint tuning in Postgres as you mention. For Linux, there are similar > issues that have to be tuned on many kernels, or up to 40% of RAM can fill > with dirty pages not written to disk. > > Letting the OS do it doesn't get rid of the problem, both levels of cache > share very similar issues with large sizes and dirty pages followed by a > sync. > > The buffer cache in shared_buffers is a lot more efficient for large > scanning queries -- A select count(*) test will be CPU bound if it comes > from shared_buffers or the OS page cache, and in the former case I have seen > it execute up to 50% faster than the latter, by avoiding calling out to the > OS to get pages, purely as a result of less CPU used. > > > >> >> >> However that isn't the exact same thing as a "cache" at least as I was >> trying to describe it. shared buffers are used to keep track of pages >> (as well as some other stuff) and their current status. That is not the >> same as caching a relation. >> >> It is not possible to pin a relation to memory using PostgreSQL. >> PostgreSQL relies on the operating system for that type of caching. >> > > The OS can't pin a relation either, from its point of view its all just a > bunch of disk data blocks, not relations -- so it is all roughly > equivalent. The OS can do a bit better job at data prefetch on sequential > scans or other predictable seek sequences (ARC stands for Adaptive > Replacement Cache) than postgres currently does (no intelligent prefetch in > postgres AFAIK). > > So I apologize if I made it sound like Postgres cached the actual relation, > its just pages -- but it is basically the same thing as the OS cache, but > kept in process closer to the code that needs it. Its a cache that prevents > disk reads. > > My suggestion for the OP is to try it both ways, and see what is better for > his workload / OS / Hardware combination. > > >> >> Joshua D. Drake >> >> >> >> >> >> -- >> >> > >