Gregory Stark wrote:
"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

It seems doable, but it's not pretty. One possible scheme would be to
emit a record *after* chosing a name but *before* creating the file,
No, because the way you know the name is good is a successful
open(O_CREAT).
The idea was to log *twice*. Once the we're about to create a file, and
the second time that we succeeded. That way, the filename shows up in the
log, even if we crash immediatly after physically creating the file, which
gives recovery at least a chance to clean up the mess.

It sounds like if the reason it fails is because someone else created the same
file name you'll delete the wrong file?

Carefull bookkeeping during recovery should be able to eliminate that risk,
I think. I've thought a bit more like this, and came up with the following
idea that also take checkpoints into account.

We keep a global table of (xid, filename) pairs in shared memory. File creation
becomes
  1) Generate a new filename
  2) Add (CurrentTransactionId, filename) to the list, emit a XLOG record
     saying we did this, and flush the log. If the filename is already on
     the list, start over at (1).
  3) Create the file. If this fails, delete the list entry and the file,
     and start over at (1).
  4) On (sub) transaction ABORT, we remove entries with the xids we abort,
     and delete the files.
  5) On top transaction COMMIT, we remove entries with the xids we commit,
     and keep the files.
  6) During top transaction PREPARE, we record the entries with matching xids
     in the 2PC state file.

When creating a checkpoint, we include the global filelist in the checkpoint. We
might need some interlock to ensure that concurrent global filelist updates don't get lost - but maybe doing things in the correct order is sufficient to
guarantee this.

During recovery, we track the fate of the files in a similar (but local) list.
 .) We initialize our local tracking list with the one found in the latest
    CHECKPOINT.
 .) When we encounter a COMMIT record, we remove all files with xids matching
    those in the COMMIT record without deleting them.
 .) When we encounter a PREPARE record, we remove all files with matching xids,
    and record them in the 2PC state file. They are deleted if the PREPARED
    transaction is aborted.
 .) When we encounter an ABORT record, we remove all files with matching xids
    from the list, and delete them.
 .) When we encounter a runtime CHECKPOINT, it's list should match our tracking
    list.
 .) When we encounter a shutdown CHECKPOINT, we remove all files from our local
    list that are not in the checkpoint's list, and delete those files.

The XLOG flush in step (2) is pretty nasty, but I think any solution that
guarantees to prevent leaks will have to flush something to disk at that
point. The global table isn't too appealing either, because it
will limit how many concurrent transactions will be able to create files. It
could be replaced by some on-disk thing, though.

This solution sounds rather heavy-weight, but I thought I'd share the idea.

Back to work on lazy xid assignment now ;-)

greetings, Florian Pflug

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Reply via email to