I had a look at the UNDO patches at https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com, and at the patch to use the UNDO logs to clean up orphaned files, from undo-2019-05-10.tgz earlier in this thread. Are these the latest ones to review?

Thanks Thomas and Amit and others for working on this! Orphaned relfiles has been an ugly wart forever. It's a small thing, but really nice to fix that finally. This has been a long thread, and I haven't read it all, so please forgive me if I repeat stuff that's already been discussed.

There are similar issues in CREATE/DROP DATABASE code. If you crash in the middle of CREATE DATABASE, you can be left with orphaned files in the data directory, or if you crash in the middle of DROP DATABASE, the data might be gone already but the pg_database entry is still there. We should plug those holes too.

There's a lot of stuff in the patches that are not relevant for cleaning up orphaned files. I know this cleaning up orphaned files work is mainly a vehicle to get the UNDO log committed, so that's expected. If we only cared about orphaned files, I'm sure the patches wouldn't spend so much effort on concurrency, for example. Nevertheless, I think we should leave out some stuff that's clearly unused, for now. For example, a bunch of fields in the record format: uur_block, uur_offset, uur_tuple. You can add them later, as part of the patches that actually need them, but for now they just make the patch larger to review.

Some more thoughts on the record format:

I feel that the level of abstraction is not quite right. There are a bunch of fields, like uur_block, uur_offset, uur_tuple, that are probably useful for some UNDO resource managers (zheap I presume), but seem kind of arbitrary. How is uur_tuple different from uur_payload? Should they be named more generically as uur_payload1 and uur_payload2? And why two, why not three or four different payloads? In the WAL record format, there's a concept of "block id", which allows you to store N number of different payloads in the record, I think that would be a better approach. Or only have one payload, and let the resource manager code divide it as it sees fit.

Many of the fields support a primitive type of compression, where a field can be omitted if it has the same value as on the first record on an UNDO page. That's handy. But again I don't like the fact that the fields have been hard-coded into the UNDO record format. I can see e.g. the relation oid to be useful for many AMs. But not all. And other AMs might well want to store and deduplicate other things, aside from the fields that are in the patch now. I'd like to move most of the fields to AM specific code, and somehow generalize the compression. One approach would be to let the AM store an arbitrary struct, and run it through a general-purpose compression algorithm, using the UNDO page's first record as the "dictionary". Or make the UNDO page's first record available in whole to the AM specific code, and let the AM do the deduplication. For cleaning up orphaned files, though, we don't really care about any of that, so I'd recommend just ripping it out for now. Compression/deduplication can be added later as a separate patch.

The orphaned-file cleanup patch doesn't actually use the uur_reloid field. It stores the RelFileNode instead, in the paylod. I think that's further evidence that the hard-coded fields in the record format are not quite right.


I don't like the way UndoFetchRecord returns a palloc'd UnpackedUndoRecord. I would prefer something similar to the xlogreader API, where a new call to UndoFetchRecord invalidates the previous result. On efficiency grounds, to avoid the palloc, but also to be consistent with xlogreader.

In the UNDO page header, there are a bunch of fields like pd_lower/pd_upper/pd_special that are copied from the "standard" page header, that are unused. There's a FIXME comment about that too. Let's remove them, there's no need for UNDO pages to look like standard relation pages. The LSN needs to be at the beginning, to work with the buffer manager, but that's the only requirement.

Could we leave out the UNDO and discard worker processes for now? Execute all UNDO actions immediately at rollback, and after crash recovery. That would be fine for cleaning up orphaned files, and it would cut down the size of the patch to review.

Can this race condition happen: Transaction A creates a table and an UNDO record to remember it. The transaction is rolled back, and the file is removed. Another transaction, B, creates a different table, and chooses the same relfilenode. It loads the table with data, and commits. Then the system crashes. After crash recovery, the UNDO record for the first transaction is applied, and it removes the file that belongs to the second table, created by transaction B.

- Heikki


Reply via email to