On Mon, Mar 27, 2017 at 9:41 AM, Andres Freund <and...@anarazel.de> wrote: > Hi, > > > SharedBufFile allows temporary files to be created by one backend and > then exported for read-only access by other backends, with clean-up > managed by reference counting associated with a DSM segment. This includes > changes to fd.c and buffile.c to support new kinds of temporary file. > > > diff --git a/src/backend/storage/file/buffile.c > b/src/backend/storage/file/buffile.c > index 4ca0ea4..a509c05 100644 > --- a/src/backend/storage/file/buffile.c > +++ b/src/backend/storage/file/buffile.c > > I think the new facilities should be explained in the file's header.
Will do. > @@ -68,9 +71,10 @@ struct BufFile > * avoid making redundant FileSeek calls. > */ > > - bool isTemp; /* can only add files if this > is TRUE */ > + bool isSegmented; /* can only add files if this is TRUE > */ > > That's a bit of a weird and uncommented upon change. I was trying to cut down on the number of places we use the word 'temporary' to activate various different behaviours. In this case, the only thing it controls is whether the BufFile is backed by one single fd.c File or many segments, so I figured it should be renamed. As Peter and you have pointed out, there may be a case for removing it altogether. > @@ -79,6 +83,8 @@ struct BufFile > */ > ResourceOwner resowner; > > + BufFileTag tag; /* for discoverability > between backends */ > > Not perfectly happy with the name tag here, the name is a bit too > similar to BufferTag - something quite different. Yeah, will rename. > +static void > +make_tagged_path(char *tempdirpath, char *tempfilepath, > + const BufFileTag *tag, int segment) > +{ > + if (tag->tablespace == DEFAULTTABLESPACE_OID || > + tag->tablespace == GLOBALTABLESPACE_OID) > + snprintf(tempdirpath, MAXPGPATH, "base/%s", > PG_TEMP_FILES_DIR); > + else > + { > + snprintf(tempdirpath, MAXPGPATH, "pg_tblspc/%u/%s/%s", > + tag->tablespace, > TABLESPACE_VERSION_DIRECTORY, > + PG_TEMP_FILES_DIR); > + } > + > + snprintf(tempfilepath, MAXPGPATH, "%s/%s%d.%d.%d.%d.%d", tempdirpath, > + PG_TEMP_FILE_PREFIX, > + tag->creator_pid, tag->set, tag->partition, > tag->participant, > + segment); > > Is there a risk that this ends up running afoul of filename length > limits on some platforms? Hmm. I didn't think so. Do we have a project guideline on maximum path lengths based on some kind of survey? There are various limits involved (filesystem and OS per-path-component limits, total limits, and the confusing PATH_MAX, MAX_PATH etc macros), but I was under the impression that these numbers were always at least 255. This scheme seems capable of producing ~50 bytes in the final component (admittedly more if int is 64 bits), and then nowhere near enough to reach a limit of that order in the earlier components. > +} > + > +static File > +make_tagged_segment(const BufFileTag *tag, int segment) > +{ > + File file; > + char tempdirpath[MAXPGPATH]; > + char tempfilepath[MAXPGPATH]; > + > + /* > + * There is a remote chance that disk files with this (pid, set) pair > + * already exists after a crash-restart. Since the presence of > + * consecutively numbered segment files is used by BufFileOpenShared > to > + * determine the total size of a shared BufFile, we'll defend against > + * confusion by unlinking segment 1 (if it exists) before creating > segment > + * 0. > + */ > > Gah. Why on earth aren't we removing temp files when restarting, not > just on the initial start? That seems completely wrong? See the comment above RemovePgTempFiles in fd.c. From comments on this list I understand that this is a subject that Robert and Tom don't agree on. I don't mind either way, but as long as RemovePgTempFiles works that way and my patch uses the existence of files to know how many files there are, I have to defend against that danger by making sure that I don't accidentally identify files from before a crash/restart as active. > If we do decide not to change this: Why is that sufficient? Doesn't the > same problem exist for segments later than the first? It does exist and it is handled. The comment really should say "unlinking segment N + 1 (if it exists) before creating segment N". Will update. > +/* > + * Open a file that was previously created in another backend with > + * BufFileCreateShared. > + */ > +BufFile * > +BufFileOpenTagged(const BufFileTag *tag) > +{ > + BufFile *file = (BufFile *) palloc(sizeof(BufFile)); > + char tempdirpath[MAXPGPATH]; > + char tempfilepath[MAXPGPATH]; > + Size capacity = 1024; > + File *files = palloc(sizeof(File) * capacity); > + int nfiles = 0; > + > + /* > + * We don't know how many segments there are, so we'll probe the > + * filesystem to find out. > + */ > + for (;;) > + { > + /* See if we need to expand our file space. */ > + if (nfiles + 1 > capacity) > + { > + capacity *= 2; > + files = repalloc(files, sizeof(File) * capacity); > + } > + /* Try to load a segment. */ > + make_tagged_path(tempdirpath, tempfilepath, tag, nfiles); > + files[nfiles] = PathNameOpenTemporaryFile(tempfilepath); > + if (files[nfiles] <= 0) > + break; > > Isn't 0 a theoretically valid return value from > PathNameOpenTemporaryFile? I was confused by that too, because it isn't the way normal OS fds work. But existing code dealing with Postgres vfd return values treats 0 as an error. See for example OpenTemporaryFile and OpenTemporaryFileInTablespace. > +/* > + * Delete a BufFile that was created by BufFileCreateTagged. Return true if > + * at least one segment was deleted; false indicates that no segment was > + * found, or an error occurred while trying to delete. Errors are logged but > + * the function returns normally because this is assumed to run in a clean-up > + * path that might already involve an error. > + */ > +bool > +BufFileDeleteTagged(const BufFileTag *tag) > +{ > + char tempdirpath[MAXPGPATH]; > + char tempfilepath[MAXPGPATH]; > + int segment = 0; > + bool found = false; > + > + /* > + * We don't know if the BufFile really exists, because SharedBufFile > + * tracks only the range of file numbers. If it does exists, we don't > + * know many 1GB segments it has, so we'll delete until we hit ENOENT > or > + * an IO error. > + */ > + for (;;) > + { > + make_tagged_path(tempdirpath, tempfilepath, tag, segment); > + if (!PathNameDeleteTemporaryFile(tempfilepath, false)) > + break; > + found = true; > + ++segment; > + } > + > + return found; > +} > > If we crash in the middle of this, we'll leave the later files abanded, > no? Yes. In general, there are places we can crash or unplug the server etc and leave files behind. In that case, RemovePgTempFiles cleans up (or declines to do so deliberately to support debugging, as discussed). > +/* > + * BufFileSetReadOnly --- flush and make read-only, in preparation for > sharing > + */ > +void > +BufFileSetReadOnly(BufFile *file) > +{ > + BufFileFlush(file); > + file->readOnly = true; > +} > > That flag is unused, right? It's used for an assertion in BufFileWrite. Maybe could be elog(ERROR, ...) instead, but either way it's a debugging aid to report misuse. > + * PathNameCreateTemporaryFile, PathNameOpenTemporaryFile and > + * PathNameDeleteTemporaryFile are used for temporary files that may be > shared > + * between backends. A File created or opened with these functions is not > + * automatically deleted when the file is closed, but it is automatically > + * closed and end of transaction and counts agains the temporary file limit > of > + * the backend that created it. Any File created this way must be explicitly > + * deleted with PathNameDeleteTemporaryFile. Automatic file deletion is not > + * provided because this interface is designed for use by buffile.c and > + * indirectly by sharedbuffile.c to implement temporary files with shared > + * ownership and cleanup. > > Hm. Those name are pretty easy to misunderstand, no? s/Temp/Shared/? Hmm. Yeah these may be better. Will think about that. > /* > + * Called whenever a temporary file is deleted to report its size. > + */ > +static void > +ReportTemporaryFileUsage(const char *path, off_t size) > +{ > + pgstat_report_tempfile(size); > + > + if (log_temp_files >= 0) > + { > + if ((size / 1024) >= log_temp_files) > + ereport(LOG, > + (errmsg("temporary file: path \"%s\", > size %lu", > + path, (unsigned long) > size))); > + } > +} > > Man, the code for this sucks (not your fault). Shouldn't this properly > be at the buffile.c level, where we could implement limits above 1GB > properly? +1 > +/* > + * Open a file that was created with PathNameCreateTemporaryFile in another > + * backend. Files opened this way don't count agains the temp_file_limit of > + * the caller, are read-only and are automatically closed at the end of the > + * transaction but are not deleted on close. > + */ > > This really reinforces my issues with the naming scheme. This ain't a > normal tempfile. It sort of makes sense if you consider that a 'named' temporary file is different... but yeah, point taken. > +File > +PathNameOpenTemporaryFile(char *tempfilepath) > +{ > + File file; > + > + /* > + * Open the file. Note: we don't use O_EXCL, in case there is an > orphaned > + * temp file that can be reused. > + */ > + file = PathNameOpenFile(tempfilepath, O_RDONLY | PG_BINARY, 0); > > If so, wouldn't we need to truncate the file? Yes, this lacks O_TRUNC. Thanks. > + * A single SharedBufFileSet can manage any number of 'tagged' BufFiles that > + * are shared between a fixed number of participating backends. Each shared > + * BufFile can be written to by a single participant but can be read by any > + * backend after it has been 'exported'. Once a given BufFile is exported, > it > + * becomes read-only and cannot be extended. To create a new shared BufFile, > + * a participant needs its own distinct participant number, and needs to > + * specify an arbitrary partition number for the file. To make it available > + * to other backends, it must be explicitly exported, which flushes internal > + * buffers and renders it read-only. To open a file that has been shared, a > + * backend needs to know the number of the participant that created the file, > + * and the partition number. It is the responsibily of calling code to > ensure > + * that files are not accessed before they have been shared. > > Hm. One way to make this safer would be to rename files when exporting. > Should be sufficient to do this to the first segment, I guess. Interesting idea. Will think about that. That comment isn't great and repeats itself. Will improve. > + * Each file is identified by a partition number and a participant number, so > + * that a SharedBufFileSet can be viewed as a 2D table of individual files. > > I think using "files" as a term here is a bit dangerous - they're > individually segmented again, right? True. It's a 2D matrix of BufFiles. The word "file" is super overloaded here. Will fix. > +/* > + * The number of bytes of shared memory required to construct a > + * SharedBufFileSet. > + */ > +Size > +SharedBufFileSetSize(int participants) > +{ > + return offsetof(SharedBufFileSet, participants) + > + sizeof(SharedBufFileParticipant) * participants; > +} > > The function name sounds a bit like a function actuallize setting some > size... s/Size/DetermineSize/? Hmm yeah "set" as verb vs "set" as noun. I think "estimate" is the established word for this sort of thing (even though that seems strange because it sounds like it doesn't have to be exactly right: clearly in all these shmem-space-reservation functions it has to be exactly right). Will change. > > +/* > + * Create a new file suitable for sharing. Each backend that calls this must > + * use a distinct participant number. Behavior is undefined if a participant > + * calls this more than once for the same partition number. Partitions > should > + * ideally be numbered consecutively or in as small a range as possible, > + * because file cleanup will scan the range of known partitions looking for > + * files. > + */ > > Wonder if we shouldn't just create a directory for all such files. Hmm. Yes, that could work well. Will try that. > I'm a bit unhappy with the partition terminology around this. It's > getting a bit confusing. We have partitions, participants and > segements. Most of them could be understood for something entirely > different than the meaning you have here... Ok. Let me try to explain and defend them and see if we can come up with something better. 1. Segments are what buffile.c already calls the individual capped-at-1GB files that it manages. They are an implementation detail that is not part of buffile.c's user interface. There seems to be no reason to change that. My understanding is that this was done to support pre-large-file filesystems/OSs which limited files to 2^31 or 2^32 bytes, and we decided to cap the segments at 1GB (maybe some ancient OS had a 2^30 limit, or maybe it was just a conservative choice that's easy for humans to think about). We could perhaps get rid of that entirely today without anyone complaining and just use one big file, though don't know that and I'm not suggesting it. (One argument against that is that the parallel CREATE INDEX patch actually makes use of the segmented nature of BufFiles to splice them together, to 'unify' a bunch of worker LogicalTapeSets to create one LogicalTapeSet. That's off topic here but it's in the back of my mind as a potential client of this code. I'll have more to say about that over on the parallel CREATE INDEX thread shortly.) 2. Partitions here = 'batches'. The 'batches' used by the hash join code are formally partitions in all the literature on hash joins, and I bet that anyone else doing parallel work that involves sharing temporary disk files will run into the need for partitioning. I think you are complaining that we now have a database object called a PARTITION, and that may be a problem because we're overloading the term. But it's the same name because it's mathematically the same thing. We don't complain about the existence of 'lock tables' or 'hash tables' just because there is a database object called a TABLE. I considered other names for this, like "file number", but it was confusing and vague. I keep coming back to "partition" for this, because fundamentally this is for partitioning temporary data. I could maybe call it "file_partition" or something? 3. Participants are what I have taken to calling the processes involved in parallel query, when I mean the larger set that includes workers + leader. It may seem a little odd that such a thing appears in an API that deals with temporary files. But the basic idea here is that each participant gets to write out its own partial results, for each partition. Stepping back a bit, that means that there are two kinds of partitioning going on at the same time. Partitioning the keyspace into batch numbers, and then the arbitrary partitioning that comes from each participant processing partial plans. This is how SharedBufFileSet finishes up managing a 2D matrix of BufFiles. You might argue that buffile.c shouldn't know about partitions and participants. The only thing I really need here is for BufFileTag (to be renamed) to be able to name things differently. Perhaps it should just include a char[] buffer for a name fragment, and the SharedBufFileSet should encode the partition and participant numbers into it, rather than exposing these rather higher level concepts to buffile.c. I will think about that. (Perhaps SharedBufFileSet should be called PartitionedBufFileSet?) > +static void > +shared_buf_file_on_dsm_detach(dsm_segment *segment, Datum datum) > +{ > + bool unlink_files = false; > + SharedBufFileSet *set = (SharedBufFileSet *) DatumGetPointer(datum); > + > + SpinLockAcquire(&set->mutex); > + Assert(set->refcount > 0); > + if (--set->refcount == 0) > + unlink_files = true; > + SpinLockRelease(&set->mutex); > > I'm a bit uncomfortable with releasing a refcount, and then still using > the memory from the set... I don't think there's a concrete danger > here as the code stands, but it's a fairly dangerous pattern. Will fix. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers