On Fri, May 13, 2011 at 3:48 PM, Merlin Moncure <mmonc...@gmail.com> wrote: > what's changed: > *) as advertised, i'm no longer bothering to cache invalid bits. hint > bit i/o via rollbacked transactions is not a big deal IMNSHO, so they > will work as they have always done. > *) all the tuple visibility routines are now participating in the cache > *) xmax commit bits are now being cached, not just xmin. this > prevents hint bit i/o following deletes. > *) juggled the cache interaction a bit so the changes to the > visibility routines could be a lot more surgical. specifically, I > reverted SetHintBits() to return void and put the cache adjustment > inside 'SetHintBitsCache()' which sets the bits, dirties, and adjusts > the cache. SetHintBitsNonDirty() only sets the bits without dirtying > the page, and is called when you get a cache hit. > *) various minor cleanups, decided to keep pageindex as type > TransactionId since that's what clog does. > *) made a proper init function, put cache data into > CacheMemoryContext, and moved the cache data pointer references into > the page structure > > performance testing done: > *) select only pgbench and standard i/o pgbech tests don't show > performance difference within statistical noise. > > The test I need to do, a cache and clog thrashing test, I haven't > found a clean way to put together yet. I'm really skeptical that any > problems will show up there. That said, I am satisfied the patch does > what it is supposed to do -- eliminates hint bit i/o without for cases > where it is a big headache without penalizing other cases. Anyone > that would like to comment on the technique or other points of the > patch please do so (otherwise it's time for the 'fest).
OK, so I read through this patch. I agree with Simon's comment on the other thread that it's probably not ready for commit right at the moment, but also want to offer a few other thoughts. The patch fails to conform to our coding style in a variety of ways, but I don't want to get bogged down in that at this point. The first question we need to answer here is: What is this patch doing? And do we want that? With regards to the first question, it seems to me that the patch is doing two separate but related things. 1. It allocates four 8kB pages in backend-local memory, each of which can store one bit for each XID in a 64K range. The bit is set if the XID is known to be committed. If we find this bit set, then we needn't consult CLOG. This reduces CLOG traffic, at the cost of potentially doing more work in the case where the cache misses. Every so often, we reconsider which XID ranges should be represented in our cache and consider replacing an existing page in the cache with some other one. 2. When we probe the cache and hit, we set the hint bit on the tuple *without dirtying the page*. Thus, the hint bit will only get written back to disk if the page is dirtied for some other reason. This will frequently reduce the amount of write I/O generated by large table scans, at the cost of possibly generating additional work for other backends who try to read this data later. Now, the interesting thing about the patch is that the downsides of #2 are considerably mitigated by #1. If we assume for the sake of argument that the backend-local cache is really, really fast, and that the cache replacement policy is sound, then one is just as good as the other, and the only time we need the hint bits is when the cache overflows, which just so happens to be exactly the patch forces them to be written out to disk. That's pretty clever. The exact way this is implemented is a little bit grotty, but I feel like it can work. So I'm inclined to think that, at 10,000 feet, if we can convince ourselves that the basic idea of sticking a cache in here is OK, then the additional refinement of declining to dirty the page when the cache, rather than CLOG, tell us that the XID is committed is probably OK too. With respect to the cache itself, I think the part that concerns me most is the cache replacement algorithm. It's obvious that straight LRU can't work here, since that could easily result in horrible thrashing behavior. But it's not clear to me that the actual algorithm, which considers evicting a page after every 100 misses, is all that great either. Each page stores 64K transaction IDs, and after being in cache for a while, you might have 25% or 50% of those bits set, which is quite an asset. Now you hit a run of 100 tuples with some XID not represented in that cache and you chuck that page out the window and replace it with a page that has just that one bit set. Now you hit another run of 100 tuples with XIDs that would have hit the old page and flip it back; but now you've lost all those valuable bits you had set. I'm not sure how likely that is to happen in real life, but it certainly seems like it could be a bit painful if it did. The cache replacement algorithm is not cheap. Rather than just fiddling with the thresholds, it would be nice to use some sort of algorithm here that would be inherently resistant to cache thrashing. For example, if we always just cached the four most recent pages of XIDs, and only ever replaced the oldest page we had in the cache with one newer than any page we had in the cache, that would make cache thrashing basically impossible. Fairly obviously, that particular algorithm is probably not a good idea because it makes it impossible for us to adapt the cache contents to the data that is actually in the table, so we need something a bit smarter. It might be worth doing a search of the academic literature on cache replacement policies and see if anyone else has come up with a good algorithm for this kind of situation. Just to be clear, I'm not saying your algorithm is definitely bad; just that it seems possible that it might be bad, and that it feels like there's not a lot of justification for the particular method you've chosen rather than anything else. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers