On 01.08.2019 6:10, Craig Ringer wrote:
3. It is not possible to use temporary tables at replica.
For physical replicas, yes.
Yes, definitely logical replicas (for example our PgPro-EE multimaster
based on logical replication) do not suffer from this problem.
But in case of multimaster we have another problem related with
temporary tables: we have to use 2PC for each transaction and using
temporary tables in prepared transaction is now prohibited.
This was the motivation of the patch proposed by Stas Kelvich which
allows to use temporary tables in prepared transactions under some
conditions.
5. Inefficient memory usage and possible memory overflow: each backend
maintains its own local buffers for work with temporary tables.
Is there any reason that would change with global temp tables? We'd
still be creating a backend-local relfilenode for each backend that
actually writes to the temp table, and I don't see how it'd be useful
or practical to keep those in shared_buffers.
Yes, my implementation of global temp tables is using shared buffers.
It was not strictly needed as far as data is local. It is possible to
have shared metadata and private data accessed through local buffers.
But I have done it for three reasons:
1, Make it possible to use parallel plans for temp tables.
2. Eliminate memory overflow problem.
3. Make in possible to reschedule session to other backens (connection
pooler).
Using local buffers has big advantages too. It saves shared_buffers
space for data where there's actually some possibility of getting
cache hits, or for where we can benefit from lazy/async writeback and
write combining. I wouldn't want to keep temp data there if I had the
option.
Definitely local buffers have some advantages:
- do not require synchronization
- avoid flushing data from shared buffers
But global temp tables are not excluding use of original (local) temp
tables.
So you will have a choice: either to use local temp tables which can be
easily created on demand and accessed through local buffers,
either create global temp tables, which eliminate catalog bloating,
allow parallel queries and which data is controlled by the same cache
replacement discipline as for normal tables...
Default size of temporary buffers is 8Mb. It seems to be too small
for
modern servers having hundreds of gigabytes of RAM, causing extra
copying of data between OS cache and local buffers. But if there are
thousands of backends, each executing queries with temporary tables,
then total amount of memory used for temporary buffers can exceed
several tens of gigabytes.
Right. But what solution do you propose for this? Putting that in
shared_buffers will do nothing except deprive shared_buffers of space
that can be used for other more useful things. A server-wide temp
buffer would add IPC and locking overheads and AFAICS little benefit.
One of the big appeals of temp tables is that we don't need any of that.
I do not think that parallel execution and efficient connection pooling
are "little benefit".
If you want to improve server-wide temp buffer memory accounting and
management that makes sense. I can see it being useful to have things
like a server-wide DSM/DSA pool of temp buffers that backends borrow
from and return to based on memory pressure on a LRU-ish basis, maybe.
But I can also see how that'd be complex and hard to get right. It'd
also be prone to priority inversion problems where an idle/inactive
backend must be woken up to release memory or release locks, depriving
an actively executing backend of runtime. And it'd be as likely to
create inefficiencies with copying and eviction as solve them since
backends could easily land up taking turns kicking each other out of
memory and re-reading their own data.
I don't think this is something that should be tackled as part of work
on global temp tables personally.
My assumptions are the following: temporary tables are mostly used in
OLAP queries. And OLAP workload means that there are few concurrent
queries which are working with large datasets.
So size of produced temporary tables can be quite big. For OLAP it seems
to be very important to be able to use parallel query execution and use
the same cache eviction rule both for persistent and temp tables
(otherwise you either cause swapping, either extra copying of data
between OS and Postgres caches).
6. Connection pooler can not reschedule session which has created
temporary tables to some other backend because it's data is stored
in local buffers.
Yeah, if you're using transaction-associative pooling. That's just
part of a more general problem though, there are piles of related
issues with temp tables, session GUCs, session advisory locks and more.
I don't see how global temp tables will do you the slightest bit of
good here as the data in them will still be backend-local. If it isn't
then you should just be using unlogged tables.
You can not use the same unlogged table to save intermediate query
results in two parallel sessions.
Definition of this table (metadata) is shared by all backends but
data
is private to the backend. After session termination data is
obviously lost.
+1 that's what a global temp table should be, and it's IIRC pretty
much how the SQL standard specifies temp tables.
I suspect I'm overlooking some complexities here, because to me it
seems like we could implement these fairly simply. A new relkind would
identify it as a global temp table and the relfilenode would be 0.
Same for indexes on temp tables. We'd extend the relfilenode mapper to
support a backend-local non-persistent relfilenode map that's used to
track temp table and index relfilenodes. If no relfilenode is defined
for the table, the mapper would allocate one. We already happily
create missing relfilenodes on write so we don't even have to
pre-create the actual file. We'd register the relfilenode as a
tempfile and use existing tempfile cleanup mechanisms, and we'd use
the temp tablespace to store it.
I must be missing something important because it doesn't seem hard.
As I already wrote, I tried to kill two bird with one stone: eliminate
catalog bloating and allow access to temp tables from multiple backends
(to be able to perform parallel queries and connection pooling).
This is why I have to use shared buffers for global temp tables.
May be it was not so good idea. But it was one of my primary intention
of publishing this patch to know opinion of other people.
In PG-Pro some of my colleagues think that the most critical problem is
inability to use temporary tables at replica.
Other think that it is not a problem at all if you are using logical
replication.
From my point of view the most critical problem is inability to use
parallel plans for temporary tables.
But looks like you don't think so.
I see three different activities related with temporary tables:
1. Shared metadata
2. Shared buffers
3. Alternative concurrency control & reducing tuple header size
(specialized table access method for temporary tables)
In my proposal I combined 1 and 2, leaving 3 for next step.
I will be interested to know other suggestions.
One more thing - 1 and 2 are really independent: you can share metadata
without sharing buffers.
But introducing yet another kind of temporary tables seems to be really
overkill:
- local temp tables (private namespace and lcoal buffers)
- tables with shared metadata but local bufferes
- tables with shared metadata and bufferes
The drawback of such approach is that it will be necessary to
reimplement large bulk of heapam code.
But this approach allows to eliminate visibility check for temporary
table tuples and decrease size of tuple header.
That sounds potentially cool, but perhaps a "next step" thing? Allow
the creation of global temp tables to specify reloptions, and you can
add it as a reloption later. You can't actually eliminate visibility
checks anyway because they're still MVCC heaps.
Sorry?
I mean elimination of MVCC overhead (visibility checks) for temp tables
only.
I am not sure that we can really fully eliminate it if we support use of
temp tables in prepared transactions and autonomous transactions (yet
another awful feature we have in PgPro-EE).
Also looks like we need to have some analogue of CID to be able to
correctly executed queries like "insert into T (select from T ...)"
where T is global temp table.
I didn't think much about it, but I really considering new table access
method API for reducing per-tuple storage overhead for temporary and
append-only tables.
Savepoints can create invisible tuples even if you're using temp
tables that are cleared on commit, and of course so can DELETEs or
UPDATEs. So I'm not sure how much use it'd really be in practice.
Yehh, subtransactions can be also a problem for eliminating xmin/xmax
for temp tables. Thanks for noticing it.
I noticed that I have not patched some extension - fixed and rebased
version of the patch is attached.
Also you can find this version in our github repository:
https://github.com/postgrespro/postgresql.builtin_pool.git
branch global_temp.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 1bd579f..2d93f6f 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -153,9 +153,9 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
buf_state = LockBufHdr(bufHdr);
fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenode = bufHdr->tag.rnode.relNode;
- fctx->record[i].reltablespace = bufHdr->tag.rnode.spcNode;
- fctx->record[i].reldatabase = bufHdr->tag.rnode.dbNode;
+ fctx->record[i].relfilenode = bufHdr->tag.rnode.node.relNode;
+ fctx->record[i].reltablespace = bufHdr->tag.rnode.node.spcNode;
+ fctx->record[i].reldatabase = bufHdr->tag.rnode.node.dbNode;
fctx->record[i].forknum = bufHdr->tag.forkNum;
fctx->record[i].blocknum = bufHdr->tag.blockNum;
fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index 38ae240..8a04954 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -608,9 +608,9 @@ apw_dump_now(bool is_bgworker, bool dump_unlogged)
if (buf_state & BM_TAG_VALID &&
((buf_state & BM_PERMANENT) || dump_unlogged))
{
- block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode;
- block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.spcNode;
- block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode;
+ block_info_array[num_blocks].database = bufHdr->tag.rnode.node.dbNode;
+ block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.node.spcNode;
+ block_info_array[num_blocks].filenode = bufHdr->tag.rnode.node.relNode;
block_info_array[num_blocks].forknum = bufHdr->tag.forkNum;
block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum;
++num_blocks;
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index c945b28..14d4e48 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -95,13 +95,13 @@ ginRedoInsertEntry(Buffer buffer, bool isLeaf, BlockNumber rightblkno, void *rda
if (PageAddItem(page, (Item) itup, IndexTupleSize(itup), offset, false, false) == InvalidOffsetNumber)
{
- RelFileNode node;
+ RelFileNodeBackend rnode;
ForkNumber forknum;
BlockNumber blknum;
- BufferGetTag(buffer, &node, &forknum, &blknum);
+ BufferGetTag(buffer, &rnode, &forknum, &blknum);
elog(ERROR, "failed to add item to index page in %u/%u/%u",
- node.spcNode, node.dbNode, node.relNode);
+ rnode.node.spcNode, rnode.node.dbNode, rnode.node.relNode);
}
}
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 09bc6fe..c60effd 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -671,6 +671,7 @@ heapam_relation_copy_data(Relation rel, const RelFileNode *newrnode)
* init fork of an unlogged relation.
*/
if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT ||
+ rel->rd_rel->relpersistence == RELPERSISTENCE_SESSION ||
(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
forkNum == INIT_FORKNUM))
log_smgrcreate(newrnode, forkNum);
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 5962126..60f4696 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -763,7 +763,11 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
/* Read an existing block of the relation */
buf = ReadBuffer(rel, blkno);
LockBuffer(buf, access);
- _bt_checkpage(rel, buf);
+ /* Session temporary relation may be not yet initialized for this backend. */
+ if (blkno == BTREE_METAPAGE && PageIsNew(BufferGetPage(buf)) && IsSessionRelationBackendId(rel->rd_backend))
+ _bt_initmetapage(BufferGetPage(buf), P_NONE, 0);
+ else
+ _bt_checkpage(rel, buf);
}
else
{
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3ec67d4..edec8ca 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -213,6 +213,7 @@ void
XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags)
{
registered_buffer *regbuf;
+ RelFileNodeBackend rnode;
/* NO_IMAGE doesn't make sense with FORCE_IMAGE */
Assert(!((flags & REGBUF_FORCE_IMAGE) && (flags & (REGBUF_NO_IMAGE))));
@@ -227,7 +228,8 @@ XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags)
regbuf = ®istered_buffers[block_id];
- BufferGetTag(buffer, ®buf->rnode, ®buf->forkno, ®buf->block);
+ BufferGetTag(buffer, &rnode, ®buf->forkno, ®buf->block);
+ regbuf->rnode = rnode.node;
regbuf->page = BufferGetPage(buffer);
regbuf->flags = flags;
regbuf->rdata_tail = (XLogRecData *) ®buf->rdata_head;
@@ -919,7 +921,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
int flags;
PGAlignedBlock copied_buffer;
char *origdata = (char *) BufferGetBlock(buffer);
- RelFileNode rnode;
+ RelFileNodeBackend rnode;
ForkNumber forkno;
BlockNumber blkno;
@@ -948,7 +950,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
flags |= REGBUF_STANDARD;
BufferGetTag(buffer, &rnode, &forkno, &blkno);
- XLogRegisterBlock(0, &rnode, forkno, blkno, copied_buffer.data, flags);
+ XLogRegisterBlock(0, &rnode.node, forkno, blkno, copied_buffer.data, flags);
recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI_FOR_HINT);
}
@@ -1009,7 +1011,7 @@ XLogRecPtr
log_newpage_buffer(Buffer buffer, bool page_std)
{
Page page = BufferGetPage(buffer);
- RelFileNode rnode;
+ RelFileNodeBackend rnode;
ForkNumber forkNum;
BlockNumber blkno;
@@ -1018,7 +1020,7 @@ log_newpage_buffer(Buffer buffer, bool page_std)
BufferGetTag(buffer, &rnode, &forkNum, &blkno);
- return log_newpage(&rnode, forkNum, blkno, page, page_std);
+ return log_newpage(&rnode.node, forkNum, blkno, page, page_std);
}
/*
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index a065419..8814afb 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -409,6 +409,9 @@ GetNewRelFileNode(Oid reltablespace, Relation pg_class, char relpersistence)
case RELPERSISTENCE_TEMP:
backend = BackendIdForTempRelations();
break;
+ case RELPERSISTENCE_SESSION:
+ backend = BackendIdForSessionRelations();
+ break;
case RELPERSISTENCE_UNLOGGED:
case RELPERSISTENCE_PERMANENT:
backend = InvalidBackendId;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 99ae159..24b2438 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3612,7 +3612,7 @@ reindex_relation(Oid relid, int flags, int options)
if (flags & REINDEX_REL_FORCE_INDEXES_UNLOGGED)
persistence = RELPERSISTENCE_UNLOGGED;
else if (flags & REINDEX_REL_FORCE_INDEXES_PERMANENT)
- persistence = RELPERSISTENCE_PERMANENT;
+ persistence = rel->rd_rel->relpersistence == RELPERSISTENCE_SESSION ? RELPERSISTENCE_SESSION : RELPERSISTENCE_PERMANENT;
else
persistence = rel->rd_rel->relpersistence;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 3cc886f..a111ddc 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -93,6 +93,10 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
backend = InvalidBackendId;
needs_wal = false;
break;
+ case RELPERSISTENCE_SESSION:
+ backend = BackendIdForSessionRelations();
+ needs_wal = false;
+ break;
case RELPERSISTENCE_PERMANENT:
backend = InvalidBackendId;
needs_wal = true;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index cedb4ee..d11c5b3 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1400,7 +1400,7 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
*/
if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
- else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+ else if (newrelpersistence != RELPERSISTENCE_TEMP)
reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
/* Report that we are now reindexing relations */
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index fb2be10..372c9a5 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -7678,6 +7678,12 @@ ATAddForeignKeyConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel,
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
errmsg("constraints on unlogged tables may reference only permanent or unlogged tables")));
break;
+ case RELPERSISTENCE_SESSION:
+ if (pkrel->rd_rel->relpersistence != RELPERSISTENCE_SESSION)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+ errmsg("constraints on session tables may reference only session tables")));
+ break;
case RELPERSISTENCE_TEMP:
if (pkrel->rd_rel->relpersistence != RELPERSISTENCE_TEMP)
ereport(ERROR,
@@ -14082,6 +14088,13 @@ ATPrepChangePersistence(Relation rel, bool toLogged)
RelationGetRelationName(rel)),
errtable(rel)));
break;
+ case RELPERSISTENCE_SESSION:
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+ errmsg("cannot change logged status of session table \"%s\"",
+ RelationGetRelationName(rel)),
+ errtable(rel)));
+ break;
case RELPERSISTENCE_PERMANENT:
if (toLogged)
/* nothing to do */
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c97bb36..f9b2000 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -3265,20 +3265,11 @@ OptTemp: TEMPORARY { $$ = RELPERSISTENCE_TEMP; }
| TEMP { $$ = RELPERSISTENCE_TEMP; }
| LOCAL TEMPORARY { $$ = RELPERSISTENCE_TEMP; }
| LOCAL TEMP { $$ = RELPERSISTENCE_TEMP; }
- | GLOBAL TEMPORARY
- {
- ereport(WARNING,
- (errmsg("GLOBAL is deprecated in temporary table creation"),
- parser_errposition(@1)));
- $$ = RELPERSISTENCE_TEMP;
- }
- | GLOBAL TEMP
- {
- ereport(WARNING,
- (errmsg("GLOBAL is deprecated in temporary table creation"),
- parser_errposition(@1)));
- $$ = RELPERSISTENCE_TEMP;
- }
+ | GLOBAL TEMPORARY { $$ = RELPERSISTENCE_SESSION; }
+ | GLOBAL TEMP { $$ = RELPERSISTENCE_SESSION; }
+ | SESSION { $$ = RELPERSISTENCE_SESSION; }
+ | SESSION TEMPORARY { $$ = RELPERSISTENCE_SESSION; }
+ | SESSION TEMP { $$ = RELPERSISTENCE_SESSION; }
| UNLOGGED { $$ = RELPERSISTENCE_UNLOGGED; }
| /*EMPTY*/ { $$ = RELPERSISTENCE_PERMANENT; }
;
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e8ffa04..2004d2f 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -3483,6 +3483,7 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
{
ReorderBufferTupleCidKey key;
ReorderBufferTupleCidEnt *ent;
+ RelFileNodeBackend rnode;
ForkNumber forkno;
BlockNumber blockno;
bool updated_mapping = false;
@@ -3496,7 +3497,8 @@ ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
* get relfilenode from the buffer, no convenient way to access it other
* than that.
*/
- BufferGetTag(buffer, &key.relnode, &forkno, &blockno);
+ BufferGetTag(buffer, &rnode, &forkno, &blockno);
+ key.relnode = rnode.node;
/* tuples can only be in the main fork */
Assert(forkno == MAIN_FORKNUM);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6f3a402..76ce953 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -556,7 +556,7 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
int buf_id;
/* create a tag so we can lookup the buffer */
- INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
+ INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode,
forkNum, blockNum);
/* determine its hash code and partition lock ID */
@@ -710,7 +710,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
Block bufBlock;
bool found;
bool isExtend;
- bool isLocalBuf = SmgrIsTemp(smgr);
+ bool isLocalBuf = SmgrIsTemp(smgr) && relpersistence == RELPERSISTENCE_TEMP;
*hit = false;
@@ -1010,7 +1010,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
uint32 buf_state;
/* create a tag so we can lookup the buffer */
- INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
+ INIT_BUFFERTAG(newTag, smgr->smgr_rnode, forkNum, blockNum);
/* determine its hash code and partition lock ID */
newHash = BufTableHashCode(&newTag);
@@ -1532,7 +1532,8 @@ ReleaseAndReadBuffer(Buffer buffer,
{
bufHdr = GetLocalBufferDescriptor(-buffer - 1);
if (bufHdr->tag.blockNum == blockNum &&
- RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) &&
+ RelFileNodeEquals(bufHdr->tag.rnode.node, relation->rd_node) &&
+ bufHdr->tag.rnode.backend == relation->rd_backend &&
bufHdr->tag.forkNum == forkNum)
return buffer;
ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
@@ -1543,7 +1544,8 @@ ReleaseAndReadBuffer(Buffer buffer,
bufHdr = GetBufferDescriptor(buffer - 1);
/* we have pin, so it's ok to examine tag without spinlock */
if (bufHdr->tag.blockNum == blockNum &&
- RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) &&
+ RelFileNodeEquals(bufHdr->tag.rnode.node, relation->rd_node) &&
+ bufHdr->tag.rnode.backend == relation->rd_backend &&
bufHdr->tag.forkNum == forkNum)
return buffer;
UnpinBuffer(bufHdr, true);
@@ -1845,8 +1847,8 @@ BufferSync(int flags)
item = &CkptBufferIds[num_to_scan++];
item->buf_id = buf_id;
- item->tsId = bufHdr->tag.rnode.spcNode;
- item->relNode = bufHdr->tag.rnode.relNode;
+ item->tsId = bufHdr->tag.rnode.node.spcNode;
+ item->relNode = bufHdr->tag.rnode.node.relNode;
item->forkNum = bufHdr->tag.forkNum;
item->blockNum = bufHdr->tag.blockNum;
}
@@ -2559,7 +2561,7 @@ PrintBufferLeakWarning(Buffer buffer)
}
/* theoretically we should lock the bufhdr here */
- path = relpathbackend(buf->tag.rnode, backend, buf->tag.forkNum);
+ path = relpathbackend(buf->tag.rnode.node, backend, buf->tag.forkNum);
buf_state = pg_atomic_read_u32(&buf->state);
elog(WARNING,
"buffer refcount leak: [%03d] "
@@ -2631,7 +2633,7 @@ BufferGetBlockNumber(Buffer buffer)
* a buffer.
*/
void
-BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
+BufferGetTag(Buffer buffer, RelFileNodeBackend *rnode, ForkNumber *forknum,
BlockNumber *blknum)
{
BufferDesc *bufHdr;
@@ -2696,7 +2698,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
/* Find smgr relation for buffer */
if (reln == NULL)
- reln = smgropen(buf->tag.rnode, InvalidBackendId);
+ reln = smgropen(buf->tag.rnode.node, buf->tag.rnode.backend);
TRACE_POSTGRESQL_BUFFER_FLUSH_START(buf->tag.forkNum,
buf->tag.blockNum,
@@ -2930,7 +2932,7 @@ DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum,
int i;
/* If it's a local relation, it's localbuf.c's problem. */
- if (RelFileNodeBackendIsTemp(rnode))
+ if (RelFileNodeBackendIsLocalTemp(rnode))
{
if (rnode.backend == MyBackendId)
DropRelFileNodeLocalBuffers(rnode.node, forkNum, firstDelBlock);
@@ -2958,11 +2960,11 @@ DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum,
* We could check forkNum and blockNum as well as the rnode, but the
* incremental win from doing so seems small.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ if (!RelFileNodeBackendEquals(bufHdr->tag.rnode, rnode))
continue;
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
+ if (RelFileNodeBackendEquals(bufHdr->tag.rnode, rnode) &&
bufHdr->tag.forkNum == forkNum &&
bufHdr->tag.blockNum >= firstDelBlock)
InvalidateBuffer(bufHdr); /* releases spinlock */
@@ -2985,24 +2987,24 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes)
{
int i,
n = 0;
- RelFileNode *nodes;
+ RelFileNodeBackend *nodes;
bool use_bsearch;
if (nnodes == 0)
return;
- nodes = palloc(sizeof(RelFileNode) * nnodes); /* non-local relations */
+ nodes = palloc(sizeof(RelFileNodeBackend) * nnodes); /* non-local relations */
/* If it's a local relation, it's localbuf.c's problem. */
for (i = 0; i < nnodes; i++)
{
- if (RelFileNodeBackendIsTemp(rnodes[i]))
+ if (RelFileNodeBackendIsLocalTemp(rnodes[i]))
{
if (rnodes[i].backend == MyBackendId)
DropRelFileNodeAllLocalBuffers(rnodes[i].node);
}
else
- nodes[n++] = rnodes[i].node;
+ nodes[n++] = rnodes[i];
}
/*
@@ -3025,11 +3027,11 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes)
/* sort the list of rnodes if necessary */
if (use_bsearch)
- pg_qsort(nodes, n, sizeof(RelFileNode), rnode_comparator);
+ pg_qsort(nodes, n, sizeof(RelFileNodeBackend), rnode_comparator);
for (i = 0; i < NBuffers; i++)
{
- RelFileNode *rnode = NULL;
+ RelFileNodeBackend *rnode = NULL;
BufferDesc *bufHdr = GetBufferDescriptor(i);
uint32 buf_state;
@@ -3044,7 +3046,7 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes)
for (j = 0; j < n; j++)
{
- if (RelFileNodeEquals(bufHdr->tag.rnode, nodes[j]))
+ if (RelFileNodeBackendEquals(bufHdr->tag.rnode, nodes[j]))
{
rnode = &nodes[j];
break;
@@ -3054,7 +3056,7 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes)
else
{
rnode = bsearch((const void *) &(bufHdr->tag.rnode),
- nodes, n, sizeof(RelFileNode),
+ nodes, n, sizeof(RelFileNodeBackend),
rnode_comparator);
}
@@ -3063,7 +3065,7 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes)
continue;
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, (*rnode)))
+ if (RelFileNodeBackendEquals(bufHdr->tag.rnode, (*rnode)))
InvalidateBuffer(bufHdr); /* releases spinlock */
else
UnlockBufHdr(bufHdr, buf_state);
@@ -3102,11 +3104,11 @@ DropDatabaseBuffers(Oid dbid)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (bufHdr->tag.rnode.dbNode != dbid)
+ if (bufHdr->tag.rnode.node.dbNode != dbid)
continue;
buf_state = LockBufHdr(bufHdr);
- if (bufHdr->tag.rnode.dbNode == dbid)
+ if (bufHdr->tag.rnode.node.dbNode == dbid)
InvalidateBuffer(bufHdr); /* releases spinlock */
else
UnlockBufHdr(bufHdr, buf_state);
@@ -3136,7 +3138,7 @@ PrintBufferDescs(void)
"[%02d] (freeNext=%d, rel=%s, "
"blockNum=%u, flags=0x%x, refcount=%u %d)",
i, buf->freeNext,
- relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum),
+ relpath(buf->tag.rnode, buf->tag.forkNum),
buf->tag.blockNum, buf->flags,
buf->refcount, GetPrivateRefCount(b));
}
@@ -3204,7 +3206,8 @@ FlushRelationBuffers(Relation rel)
uint32 buf_state;
bufHdr = GetLocalBufferDescriptor(i);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode.node, rel->rd_node) &&
+ bufHdr->tag.rnode.backend == rel->rd_backend &&
((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
(BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
@@ -3251,13 +3254,15 @@ FlushRelationBuffers(Relation rel)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+ if (!RelFileNodeEquals(bufHdr->tag.rnode.node, rel->rd_node)
+ || bufHdr->tag.rnode.backend != rel->rd_backend)
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+ if (RelFileNodeEquals(bufHdr->tag.rnode.node, rel->rd_node) &&
+ bufHdr->tag.rnode.backend == rel->rd_backend &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
@@ -3305,13 +3310,13 @@ FlushDatabaseBuffers(Oid dbid)
* As in DropRelFileNodeBuffers, an unlocked precheck should be safe
* and saves some cycles.
*/
- if (bufHdr->tag.rnode.dbNode != dbid)
+ if (bufHdr->tag.rnode.node.dbNode != dbid)
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (bufHdr->tag.rnode.dbNode == dbid &&
+ if (bufHdr->tag.rnode.node.dbNode == dbid &&
(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
{
PinBuffer_Locked(bufHdr);
@@ -4051,7 +4056,7 @@ AbortBufferIO(void)
/* Buffer is pinned, so we can read tag without spinlock */
char *path;
- path = relpathperm(buf->tag.rnode, buf->tag.forkNum);
+ path = relpath(buf->tag.rnode, buf->tag.forkNum);
ereport(WARNING,
(errcode(ERRCODE_IO_ERROR),
errmsg("could not write block %u of %s",
@@ -4075,7 +4080,7 @@ shared_buffer_write_error_callback(void *arg)
/* Buffer is pinned, so we can read the tag without locking the spinlock */
if (bufHdr != NULL)
{
- char *path = relpathperm(bufHdr->tag.rnode, bufHdr->tag.forkNum);
+ char *path = relpath(bufHdr->tag.rnode, bufHdr->tag.forkNum);
errcontext("writing block %u of relation %s",
bufHdr->tag.blockNum, path);
@@ -4093,7 +4098,7 @@ local_buffer_write_error_callback(void *arg)
if (bufHdr != NULL)
{
- char *path = relpathbackend(bufHdr->tag.rnode, MyBackendId,
+ char *path = relpathbackend(bufHdr->tag.rnode.node, MyBackendId,
bufHdr->tag.forkNum);
errcontext("writing block %u of relation %s",
@@ -4108,22 +4113,27 @@ local_buffer_write_error_callback(void *arg)
static int
rnode_comparator(const void *p1, const void *p2)
{
- RelFileNode n1 = *(const RelFileNode *) p1;
- RelFileNode n2 = *(const RelFileNode *) p2;
+ RelFileNodeBackend n1 = *(const RelFileNodeBackend *) p1;
+ RelFileNodeBackend n2 = *(const RelFileNodeBackend *) p2;
- if (n1.relNode < n2.relNode)
+ if (n1.node.relNode < n2.node.relNode)
return -1;
- else if (n1.relNode > n2.relNode)
+ else if (n1.node.relNode > n2.node.relNode)
return 1;
- if (n1.dbNode < n2.dbNode)
+ if (n1.node.dbNode < n2.node.dbNode)
return -1;
- else if (n1.dbNode > n2.dbNode)
+ else if (n1.node.dbNode > n2.node.dbNode)
return 1;
- if (n1.spcNode < n2.spcNode)
+ if (n1.node.spcNode < n2.node.spcNode)
return -1;
- else if (n1.spcNode > n2.spcNode)
+ else if (n1.node.spcNode > n2.node.spcNode)
+ return 1;
+
+ if (n1.backend < n2.backend)
+ return -1;
+ else if (n1.backend > n2.backend)
return 1;
else
return 0;
@@ -4359,7 +4369,7 @@ IssuePendingWritebacks(WritebackContext *context)
next = &context->pending_writebacks[i + ahead + 1];
/* different file, stop */
- if (!RelFileNodeEquals(cur->tag.rnode, next->tag.rnode) ||
+ if (!RelFileNodeBackendEquals(cur->tag.rnode, next->tag.rnode) ||
cur->tag.forkNum != next->tag.forkNum)
break;
@@ -4378,7 +4388,7 @@ IssuePendingWritebacks(WritebackContext *context)
i += ahead;
/* and finally tell the kernel to write the data to storage */
- reln = smgropen(tag.rnode, InvalidBackendId);
+ reln = smgropen(tag.rnode.node, tag.rnode.backend);
smgrwriteback(reln, tag.forkNum, tag.blockNum, nblocks);
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index f5f6a29..6bd5ecb 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -68,7 +68,7 @@ LocalPrefetchBuffer(SMgrRelation smgr, ForkNumber forkNum,
BufferTag newTag; /* identity of requested block */
LocalBufferLookupEnt *hresult;
- INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
+ INIT_BUFFERTAG(newTag, smgr->smgr_rnode, forkNum, blockNum);
/* Initialize local buffers if first request in this session */
if (LocalBufHash == NULL)
@@ -111,7 +111,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
bool found;
uint32 buf_state;
- INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
+ INIT_BUFFERTAG(newTag, smgr->smgr_rnode, forkNum, blockNum);
/* Initialize local buffers if first request in this session */
if (LocalBufHash == NULL)
@@ -209,7 +209,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
Page localpage = (char *) LocalBufHdrGetBlock(bufHdr);
/* Find smgr relation for buffer */
- oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
+ oreln = smgropen(bufHdr->tag.rnode.node, MyBackendId);
PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
@@ -331,14 +331,14 @@ DropRelFileNodeLocalBuffers(RelFileNode rnode, ForkNumber forkNum,
buf_state = pg_atomic_read_u32(&bufHdr->state);
if ((buf_state & BM_TAG_VALID) &&
- RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+ RelFileNodeEquals(bufHdr->tag.rnode.node, rnode) &&
bufHdr->tag.forkNum == forkNum &&
bufHdr->tag.blockNum >= firstDelBlock)
{
if (LocalRefCount[i] != 0)
elog(ERROR, "block %u of %s is still referenced (local %u)",
bufHdr->tag.blockNum,
- relpathbackend(bufHdr->tag.rnode, MyBackendId,
+ relpathbackend(bufHdr->tag.rnode.node, MyBackendId,
bufHdr->tag.forkNum),
LocalRefCount[i]);
/* Remove entry from hashtable */
@@ -377,12 +377,12 @@ DropRelFileNodeAllLocalBuffers(RelFileNode rnode)
buf_state = pg_atomic_read_u32(&bufHdr->state);
if ((buf_state & BM_TAG_VALID) &&
- RelFileNodeEquals(bufHdr->tag.rnode, rnode))
+ RelFileNodeEquals(bufHdr->tag.rnode.node, rnode))
{
if (LocalRefCount[i] != 0)
elog(ERROR, "block %u of %s is still referenced (local %u)",
bufHdr->tag.blockNum,
- relpathbackend(bufHdr->tag.rnode, MyBackendId,
+ relpathbackend(bufHdr->tag.rnode.node, MyBackendId,
bufHdr->tag.forkNum),
LocalRefCount[i]);
/* Remove entry from hashtable */
diff --git a/src/backend/storage/freespace/fsmpage.c b/src/backend/storage/freespace/fsmpage.c
index cf7f03f..65eb422 100644
--- a/src/backend/storage/freespace/fsmpage.c
+++ b/src/backend/storage/freespace/fsmpage.c
@@ -268,13 +268,13 @@ restart:
*
* Fix the corruption and restart.
*/
- RelFileNode rnode;
+ RelFileNodeBackend rnode;
ForkNumber forknum;
BlockNumber blknum;
BufferGetTag(buf, &rnode, &forknum, &blknum);
elog(DEBUG1, "fixing corrupt FSM block %u, relation %u/%u/%u",
- blknum, rnode.spcNode, rnode.dbNode, rnode.relNode);
+ blknum, rnode.node.spcNode, rnode.node.dbNode, rnode.node.relNode);
/* make sure we hold an exclusive lock */
if (!exclusive_lock_held)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 07f3c93..204c4cb 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -33,6 +33,7 @@
#include "postmaster/bgwriter.h"
#include "storage/fd.h"
#include "storage/bufmgr.h"
+#include "storage/ipc.h"
#include "storage/md.h"
#include "storage/relfilenode.h"
#include "storage/smgr.h"
@@ -87,6 +88,18 @@ typedef struct _MdfdVec
static MemoryContext MdCxt; /* context for all MdfdVec objects */
+/*
+ * Structure used to collect information created by this backend.
+ * Data of this related should be deleted on backend exit.
+ */
+typedef struct SessionRelation
+{
+ RelFileNodeBackend rnode;
+ struct SessionRelation* next;
+} SessionRelation;
+
+
+static SessionRelation* SessionRelations;
/* Populate a file tag describing an md.c segment file. */
#define INIT_MD_FILETAG(a,xx_rnode,xx_forknum,xx_segno) \
@@ -152,6 +165,48 @@ mdinit(void)
ALLOCSET_DEFAULT_SIZES);
}
+
+/*
+ * Delete all data of session relations and remove their pages from shared buffers.
+ * This function is called on backend exit.
+ */
+static void
+TruncateSessionRelations(int code, Datum arg)
+{
+ SessionRelation* rel;
+ for (rel = SessionRelations; rel != NULL; rel = rel->next)
+ {
+ /* Remove relation pages from shared buffers */
+ DropRelFileNodesAllBuffers(&rel->rnode, 1);
+
+ /* Delete relation files */
+ mdunlink(rel->rnode, InvalidForkNumber, false);
+ }
+}
+
+/*
+ * Maintain information about session relations accessed by this backend.
+ * This list is needed to perform cleanup on backend exit.
+ * Session relation is linked in this list when this relation is created or opened and file doesn't exist.
+ * Such procedure guarantee that each relation is linked into list only once.
+ */
+static void
+RegisterSessionRelation(SMgrRelation reln)
+{
+ SessionRelation* rel = (SessionRelation*)MemoryContextAlloc(TopMemoryContext, sizeof(SessionRelation));
+
+ /*
+ * Perform session relation cleanup on backend exit. We are using shared memory hook, because
+ * cleanup should be performed before backend is disconnected from shared memory.
+ */
+ if (SessionRelations == NULL)
+ on_shmem_exit(TruncateSessionRelations, 0);
+
+ rel->rnode = reln->smgr_rnode;
+ rel->next = SessionRelations;
+ SessionRelations = rel;
+}
+
/*
* mdexists() -- Does the physical file exist?
*
@@ -218,6 +273,8 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
errmsg("could not create file \"%s\": %m", path)));
}
}
+ if (RelFileNodeBackendIsGlobalTemp(reln->smgr_rnode))
+ RegisterSessionRelation(reln);
pfree(path);
@@ -465,6 +522,19 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
if (fd < 0)
{
+ /*
+ * In case of session relation access, there may be no yet files of this relation for this backend.
+ * If so, then create file and register session relation for truncation on backend exit.
+ */
+ if (RelFileNodeBackendIsGlobalTemp(reln->smgr_rnode))
+ {
+ fd = PathNameOpenFile(path, O_RDWR | PG_BINARY | O_CREAT);
+ if (fd >= 0)
+ {
+ RegisterSessionRelation(reln);
+ goto NewSegment;
+ }
+ }
if ((behavior & EXTENSION_RETURN_NULL) &&
FILE_POSSIBLY_DELETED(errno))
{
@@ -476,6 +546,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
errmsg("could not open file \"%s\": %m", path)));
}
+ NewSegment:
pfree(path);
_fdvec_resize(reln, forknum, 1);
@@ -652,8 +723,13 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
* complaining. This allows, for example, the case of trying to
* update a block that was later truncated away.
*/
- if (zero_damaged_pages || InRecovery)
+ if (zero_damaged_pages || InRecovery || RelFileNodeBackendIsGlobalTemp(reln->smgr_rnode))
+ {
MemSet(buffer, 0, BLCKSZ);
+ /* In case of session relation we need to write zero page to provide correct result of subsequent mdnblocks */
+ if (RelFileNodeBackendIsGlobalTemp(reln->smgr_rnode))
+ mdwrite(reln, forknum, blocknum, buffer, true);
+ }
else
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
@@ -738,12 +814,18 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
BlockNumber
mdnblocks(SMgrRelation reln, ForkNumber forknum)
{
- MdfdVec *v = mdopenfork(reln, forknum, EXTENSION_FAIL);
+ /*
+ * If we access session relation, there may be no files yet of this relation for this backend.
+ * Pass EXTENSION_RETURN_NULL to make mdopen return NULL in this case instead of reporting error.
+ */
+ MdfdVec *v = mdopenfork(reln, forknum, RelFileNodeBackendIsGlobalTemp(reln->smgr_rnode)
+ ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
BlockNumber nblocks;
BlockNumber segno = 0;
/* mdopen has opened the first segment */
- Assert(reln->md_num_open_segs[forknum] > 0);
+ if (reln->md_num_open_segs[forknum] == 0)
+ return 0;
/*
* Start from the last open segments, to avoid redundant seeks. We have
diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
index a87e721..2401361 100644
--- a/src/backend/utils/adt/dbsize.c
+++ b/src/backend/utils/adt/dbsize.c
@@ -994,6 +994,9 @@ pg_relation_filepath(PG_FUNCTION_ARGS)
/* Determine owning backend. */
switch (relform->relpersistence)
{
+ case RELPERSISTENCE_SESSION:
+ backend = BackendIdForSessionRelations();
+ break;
case RELPERSISTENCE_UNLOGGED:
case RELPERSISTENCE_PERMANENT:
backend = InvalidBackendId;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 2488607..86e8fca 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1098,6 +1098,10 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
switch (relation->rd_rel->relpersistence)
{
+ case RELPERSISTENCE_SESSION:
+ relation->rd_backend = BackendIdForSessionRelations();
+ relation->rd_islocaltemp = false;
+ break;
case RELPERSISTENCE_UNLOGGED:
case RELPERSISTENCE_PERMANENT:
relation->rd_backend = InvalidBackendId;
@@ -3301,6 +3305,10 @@ RelationBuildLocalRelation(const char *relname,
rel->rd_rel->relpersistence = relpersistence;
switch (relpersistence)
{
+ case RELPERSISTENCE_SESSION:
+ rel->rd_backend = BackendIdForSessionRelations();
+ rel->rd_islocaltemp = false;
+ break;
case RELPERSISTENCE_UNLOGGED:
case RELPERSISTENCE_PERMANENT:
rel->rd_backend = InvalidBackendId;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 0cc9ede..1dff0c8 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -15593,8 +15593,8 @@ dumpTableSchema(Archive *fout, TableInfo *tbinfo)
tbinfo->dobj.catId.oid, false);
appendPQExpBuffer(q, "CREATE %s%s %s",
- tbinfo->relpersistence == RELPERSISTENCE_UNLOGGED ?
- "UNLOGGED " : "",
+ tbinfo->relpersistence == RELPERSISTENCE_UNLOGGED ? "UNLOGGED "
+ : tbinfo->relpersistence == RELPERSISTENCE_SESSION ? "SESSION " : "",
reltypename,
qualrelname);
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 62b9553..cef99d2 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -166,7 +166,18 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
}
else
{
- if (forkNumber != MAIN_FORKNUM)
+ /*
+ * Session relations are distinguished from local temp relations by adding
+ * SessionRelFirstBackendId offset to backendId.
+ * These is no need to separate them at file system level, so just subtract SessionRelFirstBackendId
+ * to avoid too long file names.
+ * Segments of session relations have the same prefix (t%d_) as local temporary relations
+ * to make it possible to cleanup them in the same way as local temporary relation files.
+ */
+ if (backendId >= SessionRelFirstBackendId)
+ backendId -= SessionRelFirstBackendId;
+
+ if (forkNumber != MAIN_FORKNUM)
path = psprintf("base/%u/t%d_%u_%s",
dbNode, backendId, relNode,
forkNames[forkNumber]);
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 090b6ba..6a39663 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -165,6 +165,7 @@ typedef FormData_pg_class *Form_pg_class;
#define RELPERSISTENCE_PERMANENT 'p' /* regular table */
#define RELPERSISTENCE_UNLOGGED 'u' /* unlogged permanent table */
#define RELPERSISTENCE_TEMP 't' /* temporary table */
+#define RELPERSISTENCE_SESSION 's' /* session table */
/* default selection for replica identity (primary key or nothing) */
#define REPLICA_IDENTITY_DEFAULT 'd'
diff --git a/src/include/storage/backendid.h b/src/include/storage/backendid.h
index 70ef8eb..f226e7c 100644
--- a/src/include/storage/backendid.h
+++ b/src/include/storage/backendid.h
@@ -22,6 +22,13 @@ typedef int BackendId; /* unique currently active backend identifier */
#define InvalidBackendId (-1)
+/*
+ * We need to distinguish local and global temporary relations by RelFileNodeBackend.
+ * The least invasive change is to add some special bias value to backend id (since
+ * maximal number of backed is limited by MaxBackends).
+ */
+#define SessionRelFirstBackendId (0x40000000)
+
extern PGDLLIMPORT BackendId MyBackendId; /* backend id of this backend */
/* backend id of our parallel session leader, or InvalidBackendId if none */
@@ -34,4 +41,10 @@ extern PGDLLIMPORT BackendId ParallelMasterBackendId;
#define BackendIdForTempRelations() \
(ParallelMasterBackendId == InvalidBackendId ? MyBackendId : ParallelMasterBackendId)
+
+#define BackendIdForSessionRelations() \
+ (BackendIdForTempRelations() + SessionRelFirstBackendId)
+
+#define IsSessionRelationBackendId(id) ((id) >= SessionRelFirstBackendId)
+
#endif /* BACKENDID_H */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index df2dda7..7adb96b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -90,16 +90,17 @@
*/
typedef struct buftag
{
- RelFileNode rnode; /* physical relation identifier */
+ RelFileNodeBackend rnode; /* physical relation identifier */
ForkNumber forkNum;
BlockNumber blockNum; /* blknum relative to begin of reln */
} BufferTag;
#define CLEAR_BUFFERTAG(a) \
( \
- (a).rnode.spcNode = InvalidOid, \
- (a).rnode.dbNode = InvalidOid, \
- (a).rnode.relNode = InvalidOid, \
+ (a).rnode.node.spcNode = InvalidOid, \
+ (a).rnode.node.dbNode = InvalidOid, \
+ (a).rnode.node.relNode = InvalidOid, \
+ (a).rnode.backend = InvalidBackendId, \
(a).forkNum = InvalidForkNumber, \
(a).blockNum = InvalidBlockNumber \
)
@@ -113,7 +114,7 @@ typedef struct buftag
#define BUFFERTAGS_EQUAL(a,b) \
( \
- RelFileNodeEquals((a).rnode, (b).rnode) && \
+ RelFileNodeBackendEquals((a).rnode, (b).rnode) && \
(a).blockNum == (b).blockNum && \
(a).forkNum == (b).forkNum \
)
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 509f4b7..3315fa0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -205,7 +205,7 @@ extern XLogRecPtr BufferGetLSNAtomic(Buffer buffer);
extern void PrintPinnedBufs(void);
#endif
extern Size BufferShmemSize(void);
-extern void BufferGetTag(Buffer buffer, RelFileNode *rnode,
+extern void BufferGetTag(Buffer buffer, RelFileNodeBackend *rnode,
ForkNumber *forknum, BlockNumber *blknum);
extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
diff --git a/src/include/storage/relfilenode.h b/src/include/storage/relfilenode.h
index 586500a..20aec72 100644
--- a/src/include/storage/relfilenode.h
+++ b/src/include/storage/relfilenode.h
@@ -75,10 +75,25 @@ typedef struct RelFileNodeBackend
BackendId backend;
} RelFileNodeBackend;
+/*
+ * Check whether it is local or global temporary relation, which data belongs only to one backend.
+ */
#define RelFileNodeBackendIsTemp(rnode) \
((rnode).backend != InvalidBackendId)
/*
+ * Check whether it is global temporary relation which metadata is shared by all sessions,
+ * but data is private for the current session.
+ */
+#define RelFileNodeBackendIsGlobalTemp(rnode) IsSessionRelationBackendId((rnode).backend)
+
+/*
+ * Check whether it is local temporary relation which exists only in this backend.
+ */
+#define RelFileNodeBackendIsLocalTemp(rnode) \
+ (RelFileNodeBackendIsTemp(rnode) && !RelFileNodeBackendIsGlobalTemp(rnode))
+
+/*
* Note: RelFileNodeEquals and RelFileNodeBackendEquals compare relNode first
* since that is most likely to be different in two unequal RelFileNodes. It
* is probably redundant to compare spcNode if the other fields are found equal,
diff --git a/src/test/regress/expected/session_table.out b/src/test/regress/expected/session_table.out
new file mode 100644
index 0000000..1b9b3f4
--- /dev/null
+++ b/src/test/regress/expected/session_table.out
@@ -0,0 +1,64 @@
+create session table my_private_table(x integer primary key, y integer);
+insert into my_private_table values (generate_series(1,10000), generate_series(1,10000));
+select count(*) from my_private_table;
+ count
+-------
+ 10000
+(1 row)
+
+\c
+select count(*) from my_private_table;
+ count
+-------
+ 0
+(1 row)
+
+select * from my_private_table where x=10001;
+ x | y
+---+---
+(0 rows)
+
+insert into my_private_table values (generate_series(1,100000), generate_series(1,100000));
+create index on my_private_table(y);
+select * from my_private_table where x=10001;
+ x | y
+-------+-------
+ 10001 | 10001
+(1 row)
+
+select * from my_private_table where y=10001;
+ x | y
+-------+-------
+ 10001 | 10001
+(1 row)
+
+select count(*) from my_private_table;
+ count
+--------
+ 100000
+(1 row)
+
+\c
+select * from my_private_table where x=100001;
+ x | y
+---+---
+(0 rows)
+
+select * from my_private_table order by y desc limit 1;
+ x | y
+---+---
+(0 rows)
+
+insert into my_private_table values (generate_series(1,100000), generate_series(1,100000));
+select * from my_private_table where x=100001;
+ x | y
+---+---
+(0 rows)
+
+select * from my_private_table order by y desc limit 1;
+ x | y
+--------+--------
+ 100000 | 100000
+(1 row)
+
+drop table my_private_table;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index fc0f141..3a6c78d 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -107,7 +107,7 @@ test: json jsonb json_encoding jsonpath jsonpath_encoding jsonb_jsonpath
# NB: temp.sql does a reconnect which transiently uses 2 connections,
# so keep this parallel group to at most 19 tests
# ----------
-test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion truncate alter_table sequence polymorphism rowtypes returning largeobject with xml
+test: plancache limit plpgsql copy2 temp session_table domain rangefuncs prepare conversion truncate alter_table sequence polymorphism rowtypes returning largeobject with xml
# ----------
# Another group of parallel tests
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 68ac56a..b7fcc99 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -172,6 +172,7 @@ test: limit
test: plpgsql
test: copy2
test: temp
+test: session_table
test: domain
test: rangefuncs
test: prepare
diff --git a/src/test/regress/sql/session_table.sql b/src/test/regress/sql/session_table.sql
new file mode 100644
index 0000000..c6663dc
--- /dev/null
+++ b/src/test/regress/sql/session_table.sql
@@ -0,0 +1,18 @@
+create session table my_private_table(x integer primary key, y integer);
+insert into my_private_table values (generate_series(1,10000), generate_series(1,10000));
+select count(*) from my_private_table;
+\c
+select count(*) from my_private_table;
+select * from my_private_table where x=10001;
+insert into my_private_table values (generate_series(1,100000), generate_series(1,100000));
+create index on my_private_table(y);
+select * from my_private_table where x=10001;
+select * from my_private_table where y=10001;
+select count(*) from my_private_table;
+\c
+select * from my_private_table where x=100001;
+select * from my_private_table order by y desc limit 1;
+insert into my_private_table values (generate_series(1,100000), generate_series(1,100000));
+select * from my_private_table where x=100001;
+select * from my_private_table order by y desc limit 1;
+drop table my_private_table;