On Wed, Nov 17, 2010 at 3:35 PM, Andres Freund <and...@anarazel.de> wrote: >> The customer is always right, but the informed customer makes better >> decisions than the uninformed customer. This idea, as proposed, does >> not work. If you only include dirty buffers at the final checkpoint >> before shutting down, you have no guarantee that any buffers that you >> either didn't write or didn't fsync previously are actually on disk. >> Therefore, you have no guarantee that the table data is not corrupted. >> So you really have to decide between including the unlogged-table >> buffers in EVERY checkpoint and not ever including them at all. Which >> one is right depends on your use case. > How can you get a buffer which was no written out *at all*? Do you want to > force all such pages to stay in shared_buffers? That sounds quite a bit more > complicated than what you proposed...
Oh, you're right. We always have to write buffers before kicking them out of shared_buffers, but if we don't fsync them we have no guarantee they're actually on disk. >> For example, consider the poster who said that, when this feature is >> available, they plan to try ripping out their memcached instance and >> replacing it with PostgreSQL running unlogged tables. Suppose this >> poster (or someone else in a similar situation) has a 64 GB and is >> currently running a 60 GB memcached instance on it, which is not an >> unrealistic scenario for memcached. Suppose further that he dirties >> 25% of that data each hour. memcached is currently doing no writes to >> disk. When he switches to PostgreSQL and sets checkpoints_segments to >> a gazillion and checkpoint_timeout to the maximum, he's going to start >> writing 15 GB of data to disk every hour - data which he clearly >> doesn't care about losing, or preserving across restarts, because he's >> currently storing it in memcached. In fact, with memcached, he'll not >> only lose data at shutdown - he'll lose data on a regular basis when >> everything is running normally. We can try to convince ourselves that >> someone in this situation will not care about needing to get 15GB of >> disposable data per hour from memory to disk in order to have a >> feature that he doesn't need, but I think it's going to be pretty hard >> to make that credible. > To really support that use case we would first need to make shared_buffers > properly scale to 64GB - which unfortunatley, in my experience, is not yet the > case. Well, that's something to aspire to. :-) > Also, see the issues in the former paragraph - I have severe doubts you can > support such a memcached scenario by pg. Either you spill to disk if your > buffers overflow (fine with me) or you need to throw away data memcached > alike. I > doubt there is a sensible implementation in pg for the latter. > > So you will have to write to disk at some point... I agree that there are difficulties, but again, doing checkpoint I/O for data that the user was willing to throw away is going in the wrong direction. >> Third use case. Someone on pgsql-general mentioned that they want to >> write logs to PG, and can abide losing them if a crash happens, but >> not on a clean shutdown and restart. This person clearly shuts down >> their production database a lot more often than I do, but that is OK. >> By explicit stipulation, they want the survive-a-clean-shutdown >> behavior. I have no problem supporting that use case, providing they >> are willing to take the associated performance penalty at checkpoint >> time, which we don't know because we haven't asked, but I'm fine with >> assuming it's useful even though I probably wouldn't use it much >> myself. > Maybe I am missing something - but why does this imply we have to write data > at checkpoints? > Just fsyncing every file belonging to an persistently-unlogged (or whatever > sensible name anyone can come up) table is not prohibively expensive - in fact > doing that on a local $PGDATA with approx 300GB and loads of tables doing so > takes less than 15s on a system with hot inode/dentry cache and no dirty > files. > (just `find $PGDATA -print0|xargs -0 fsync_many_files` with fsync_many_files > beeing a tiny c program doing posix_fadvise(POSIX_FADV_DONTNEED) on all files > and then fsyncs every one). > The assumption of a hot inode cache is realistic I think. Hmm. I don't really want to try to do it in this patch because it's complicated enough already, but if people don't mind the shutdown sequence potentially being slowed down a bit, that might allow us to have the best of both worlds without needing to invent multiple durability levels. I was sort of assuming that people wouldn't want to slow down the shutdown sequence to avoid losing data they've already declared isn't that valuable, but evidently I underestimated the demand for kinda-durable tables. If the overhead of doing this isn't too severe, it might be the way to go. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers