Suggest two small improvements for PITR.

2024-01-11 Thread Yura Sokolov

Good day, hackers.

Here I am to suggest two small improvements to Point In Time Recovery.

First is ability to recover recovery-target-time with timestamp stored 
in XLOG_RESTORE_POINT. Looks like historically this ability did exist 
and were removed unintentionally during refactoring at commit [1]

c945af80 "Refactor checking whether we've reached the recovery target."

Second is extending XLOG_BACKUP_END record with timestamp, therefore 
backup will have its own timestamp as well. It is backward compatible 
change since there were no record length check before.


Both changes slightly helps in mostly idle systems, when between several 
backups may happens no commits at all, so there's no timestamp to 
recover to.


Attached sample patches are made in reverse order:
- XLOG_BACKUP_END then XLOG_RESTORE_POINT.
Second patch made by colleague by my idea.
Publishing for both is permitted.

If idea is accepted, patches for tests will be applied as well.

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=patch;h=c945af80

---

Yura Sokolov.From 173cfc3762a97c300b618f863fd23433909cdb81 Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Wed, 3 May 2023 18:48:46 +0300
Subject: [PATCH] PGPRO-8083: record timestamp in XLOG_BACKUP_END for
 recovery_target_time

It is useful for pg_probackup to recover on backup end.

Tags: pg_probackup
---
 src/backend/access/rmgrdesc/xlogdesc.c| 15 +++--
 src/backend/access/transam/xlog.c |  6 ++--
 src/backend/access/transam/xlogrecovery.c | 39 +++
 src/include/access/xlog_internal.h|  7 
 4 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 3fd7185f217..e1114168239 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -86,10 +86,19 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 	}
 	else if (info == XLOG_BACKUP_END)
 	{
-		XLogRecPtr	startpoint;
+		xl_backup_end	xlrec = {0, 0};
+		size_t			rec_len = XLogRecGetDataLen(record);
 
-		memcpy(, rec, sizeof(XLogRecPtr));
-		appendStringInfo(buf, "%X/%X", LSN_FORMAT_ARGS(startpoint));
+		if (rec_len == sizeof(XLogRecPtr))
+			memcpy(, rec, sizeof(XLogRecPtr));
+		else if (rec_len >= sizeof(xl_backup_end))
+			memcpy(, rec, sizeof(xl_backup_end));
+
+		appendStringInfo(buf, "%X/%X", LSN_FORMAT_ARGS(xlrec.startpoint));
+
+		if (rec_len >= sizeof(xl_backup_end))
+			appendStringInfo(buf, " at %s",
+			 timestamptz_to_str(xlrec.end_time));
 	}
 	else if (info == XLOG_PARAMETER_CHANGE)
 	{
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5ebb9783e2f..42cd06cd7d8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8680,14 +8680,16 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 	}
 	else
 	{
+		xl_backup_end	xlrec;
 		char	   *history_file;
 
 		/*
 		 * Write the backup-end xlog record
 		 */
+		xlrec.startpoint = state->startpoint;
+		xlrec.end_time = GetCurrentTimestamp();
 		XLogBeginInsert();
-		XLogRegisterData((char *) (>startpoint),
-		 sizeof(state->startpoint));
+		XLogRegisterData((char *) (), sizeof(xlrec));
 		state->stoppoint = XLogInsert(RM_XLOG_ID, XLOG_BACKUP_END);
 
 		/*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 11747a6ff13..42d5b59ac25 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2312,6 +2312,13 @@ getRecordTimestamp(XLogReaderState *record, TimestampTz *recordXtime)
 		*recordXtime = ((xl_restore_point *) XLogRecGetData(record))->rp_time;
 		return true;
 	}
+	if (rmid == RM_XLOG_ID && info == XLOG_BACKUP_END)
+	{
+		if (XLogRecGetDataLen(record) < sizeof(xl_backup_end))
+			return false;
+		*recordXtime = ((xl_backup_end *) XLogRecGetData(record))->end_time;
+		return true;
+	}
 	if (rmid == RM_XACT_ID && (xact_info == XLOG_XACT_COMMIT ||
 			   xact_info == XLOG_XACT_COMMIT_PREPARED))
 	{
@@ -2640,6 +2647,38 @@ recoveryStopsAfter(XLogReaderState *record)
 		}
 	}
 
+	if (recoveryTarget == RECOVERY_TARGET_TIME &&
+		rmid == RM_XLOG_ID && info == XLOG_BACKUP_END)
+	{
+		bool	stopsHere = false;
+
+		if (getRecordTimestamp(record, ))
+		{
+			/*
+			 * Use same conditions as in recoveryStopsBefore for transaction
+			 * records to not override transactions time handling.
+			 */
+			if (recoveryTargetInclusive)
+stopsHere = recordXtime > recoveryTargetTime;
+			else
+stopsHere = recordXtime >= recoveryTargetTime;
+		}
+
+		if (stopsHere)
+		{
+			recoveryStopAfter = true;
+			recoveryStopXid = InvalidTransactionId;
+			recoveryStopLSN = InvalidXLogRecPtr;
+			recoveryStopTime = recordXtime;
+			recoveryStopName[0] = '\0';
+
+			ereport(LOG,
+	(errmsg("recovery stopping at b

Re: Vectorization of some functions and improving pg_list interface

2023-09-06 Thread Yura Sokolov

06.09.2023 13:24, Yura Sokolov wrote:

24.08.2023 17:07, Maxim Orlov wrote:

Hi!

Recently, I've been playing around with pg_lists and realize how 
annoying (maybe, I was a bit tired) some stuff related to the lists.

For an example, see this code
List *l1 = list_make4(1, 2, 3, 4),
  *l2 = list_make4(5, 6, 7, 8),
  *l3 = list_make4(9, 0, 1, 2);
ListCell *lc1, *lc2, *lc3;

forthree(lc1, l1, lc2, l2, lc3, l3) {
...
}

list_free(l1);
list_free(l2);
list_free(l3);

There are several questions:
1) Why do I need to specify the number of elements in the list in the 
function name?

Compiler already knew how much arguments do I use.
2) Why I have to call free for every list? I don't know how to call it 
right, for now I call it vectorization.

     Why not to use simple wrapper to "vectorize" function args?

So, my proposal is:
1) Add a simple macro to "vectorize" functions.
2) Use this macro to "vectorize" list_free and list_free_deep functions.
3) Use this macro to "vectorize" bms_free function.
4) "Vectorize" list_makeN functions.

For this V1 version, I do not remove all list_makeN calls in order to 
reduce diff, but I'll address

this in future, if it will be needed.

In my view, one thing still waiting to be improved if foreach loop. It 
is not very handy to have a bunch of
similar calls foreach, forboth, forthree and etc. It will be ideal to 
have single foreach interface, but I don't know how

to do it without overall interface of the loop.

Any opinions are very welcome!


Given use case doesn't assume "zero" arguments, it is possible to 
implement "lists_free" with just macro expansion (following code is not 
checked, but close to valid):


#define VA_FOR_EACH(invoke, join, ...) \
 CppConcat(VA_FOR_EACH_, VA_ARGS_NARGS(__VA_ARGS__))( \
     invoke, join, __VA_ARGS__)
#define VA_FOR_EACH_1(invoke, join, a1) \
 invoke(a1)
#define VA_FOR_EACH_2(invoke, join, a1, a2) \
 invoke(a1) join() invoke(a2)
#define VA_FOR_EACH_3(invoke, join, a1, a2, a3) \
 invoke(a1) join() invoke(a2) join() invoke(a3)
... up to 63 args

#define VA_SEMICOLON() ;

#define lists_free(...) \
 VA_FOR_EACH(list_free, VA_SEMICOLON, __VA_ARGS__)

#define lists_free_deep(...) \
 VA_FOR_EACH(list_free_deep, VA_SEMICOLON, __VA_ARGS__)

There could be couple of issues with msvc, but they are solvable.


Given we could use C99 compound literals, list contruction could be 
implemented without C vaarg functions as well


List *
list_make_impl(NodeTag t, int n, ListCell *datums)
{
List   *list = new_list(t, n);
memcpy(list->elements, datums, sizeof(ListCell)*n);
return list;
}

#define VA_COMMA() ,

#define list_make__m(Tag, type, ...) \
   list_make_impl(Tag, VA_ARGS_NARGS(__VA_ARGS__), \
 ((ListCell[]){ \
   VA_FOR_EACH(list_make_##type##_cell, VA_COMMA, __VA_ARGS__) \
 }))


#define list_make(...) list_make__m(T_List, ptr, __VA_ARGS__)
#define list_make_int(...) list_make__m(T_IntList, int, __VA_ARGS__)
#define list_make_oid(...) list_make__m(T_OidList, oid, __VA_ARGS__)
#define list_make_xid(...) list_make__m(T_XidList, xid, __VA_ARGS__)

(code is not checked)

If zero arguments (no arguments) should be supported, it is tricky 
because of mvsc, but solvable.





Re: Vectorization of some functions and improving pg_list interface

2023-09-06 Thread Yura Sokolov

24.08.2023 17:07, Maxim Orlov wrote:

Hi!

Recently, I've been playing around with pg_lists and realize how 
annoying (maybe, I was a bit tired) some stuff related to the lists.

For an example, see this code
List *l1 = list_make4(1, 2, 3, 4),
  *l2 = list_make4(5, 6, 7, 8),
  *l3 = list_make4(9, 0, 1, 2);
ListCell *lc1, *lc2, *lc3;

forthree(lc1, l1, lc2, l2, lc3, l3) {
...
}

list_free(l1);
list_free(l2);
list_free(l3);

There are several questions:
1) Why do I need to specify the number of elements in the list in the 
function name?

Compiler already knew how much arguments do I use.
2) Why I have to call free for every list? I don't know how to call it 
right, for now I call it vectorization.

     Why not to use simple wrapper to "vectorize" function args?

So, my proposal is:
1) Add a simple macro to "vectorize" functions.
2) Use this macro to "vectorize" list_free and list_free_deep functions.
3) Use this macro to "vectorize" bms_free function.
4) "Vectorize" list_makeN functions.

For this V1 version, I do not remove all list_makeN calls in order to 
reduce diff, but I'll address

this in future, if it will be needed.

In my view, one thing still waiting to be improved if foreach loop. It 
is not very handy to have a bunch of
similar calls foreach, forboth, forthree and etc. It will be ideal to 
have single foreach interface, but I don't know how

to do it without overall interface of the loop.

Any opinions are very welcome!


Given use case doesn't assume "zero" arguments, it is possible to 
implement "lists_free" with just macro expansion (following code is not 
checked, but close to valid):


#define VA_FOR_EACH(invoke, join, ...) \
CppConcat(VA_FOR_EACH_, VA_ARGS_NARGS(__VA_ARGS__))( \
invoke, join, __VA_ARGS__)
#define VA_FOR_EACH_1(invoke, join, a1) \
invoke(a1)
#define VA_FOR_EACH_2(invoke, join, a1, a2) \
invoke(a1) join() invoke(a2)
#define VA_FOR_EACH_3(invoke, join, a1, a2, a3) \
invoke(a1) join() invoke(a2) join() invoke(a3)
... up to 63 args

#define VA_SEMICOLON() ;

#define lists_free(...) \
VA_FOR_EACH(list_free, VA_SEMICOLON, __VA_ARGS__)

#define lists_free_deep(...) \
VA_FOR_EACH(list_free_deep, VA_SEMICOLON, __VA_ARGS__)

There could be couple of issues with msvc, but they are solvable.

--

Regards,
Yura




Re: When IMMUTABLE is not.

2023-06-15 Thread Yura Sokolov



15.06.2023 17:49, Tom Lane пишет:

"David G. Johnston"  writes:

The failure to find and execute the function code itself is not a failure
mode that these markers need be concerned with.  Assuming one can execute
the function an immutable function will give the same answer for the same
input for all time.

The viewpoint taken in the docs I mentioned is that an IMMUTABLE
marker is a promise from the user to the system about the behavior
of a function.  While the system does provide a few simple tools
to catch obvious errors and to make it easier to write functions
that obey such promises, it's mostly on the user to get it right.

In particular, we've never enforced that an immutable function can't
call non-immutable functions.  While that would seem like a good idea
in the abstract, we've intentionally not tried to do it.  (I'm pretty
sure there is more than one round of previous discussions of the point
in the archives, although locating relevant threads seems hard.)
One reason not to is that polymorphic functions have to be marked
with worst-case volatility labels.  There are plenty of examples of
functions that are stable for some input types and immutable for
others (array_to_string, for instance); but the marking system can't
represent that so we have to label them stable.  Enforcing that a
user-defined immutable function can't use such a function might
just break things for no gain.


"Stable vs Immutable" is much lesser problem compared to "ReadOnly vs 
Volatile".


Executing fairly read-only function more times than necessary (or less 
times),

doesn't modify data in unexpecting way.

But executing immutable/stable function, that occasionally modifies 
data, could
lead to different unexpected effects due to optimizer decided to call 
them more

or less times than query assumes.

Some vulnerabilities were present due to user defined functions used in 
index
definitions started to modify data. If "read-only" execution were forced 
in index

operations, those issues couldn't happen.

> it's mostly on the user to get it right.

It is really bad premise. Users does strange things and aren't expected 
to be

professionals who really understand whole PostgreSQL internals.

And it is strange to hear it at the same time we don't allow users to do 
query hints

since "optimizer does better" :-D

Ok, I'd go and cool myself. Certainly I don't get some point.





Re: When IMMUTABLE is not.

2023-06-15 Thread Yura Sokolov



15.06.2023 16:58, c...@anastigmatix.net пишет:

On 2023-06-15 09:21, Tom Lane wrote:

Yura Sokolov  writes:

not enough to be sure function doesn't manipulate data.


Of course not.  It is the user's responsibility to mark functions
properly.


And also, isn't it the case that IMMUTABLE should mark a function,
not merely that "doesn't manipulate data", but whose return value
doesn't depend in any way on data (outside its own arguments)?

The practice among PLs of choosing an SPI readonly flag based on
the IMMUTABLE/STABLE/VOLATILE declaration seems to be a sort of
peculiar heuristic, not something inherent in what that declaration
means to the optimizer. (And also influences what snapshot the
function is looking at, and therefore what it can see, which has
also struck me more as a tacked-on effect than something inherent
in the declaration's meaning.)


Documentation disagrees:

https://www.postgresql.org/docs/current/sql-createfunction.html#:~:text=IMMUTABLE%0ASTABLE%0AVOLATILE

> |IMMUTABLE|indicates that the function cannot modify the database and 
always returns the same result when given the same argument values


> |STABLE|indicates that the function cannot modify the database, and 
that within a single table scan it will consistently return the same 
result for the same argument values, but that its result could change 
across SQL statements.


> |VOLATILE|indicates that the function value can change even within a 
single table scan, so no optimizations can be made... But note that any 
function that has side-effects must be classified volatile, even if its 
result is quite predictable, to prevent calls from being optimized away






Re: When IMMUTABLE is not.

2023-06-15 Thread Yura Sokolov

15.06.2023 16:21, Tom Lane wrote:

Yura Sokolov  writes:

I found, than declaration of function as IMMUTABLE/STABLE is not enough to be 
sure
function doesn't manipulate data.

Of course not.  It is the user's responsibility to mark functions
properly.  Trying to enforce that completely is a fool's errand


https://github.com/postgres/postgres/commit/b2c4071299e02ed96d48d3c8e776de2fab36f88c.patch

https://github.com/postgres/postgres/commit/cdf8b56d5463815244467ea8f5ec6e72b6c65a6c.patch





Re: When IMMUTABLE is not.

2023-06-15 Thread Yura Sokolov

Sorry, previous message were smashed for some reason.

I'll try to repeat

I found, than declaration of function as IMMUTABLE/STABLE is not enough 
to be sure

function doesn't manipulate data.

In fact, SPI checks only direct function kind, but fails to check 
indirect call.


Attached immutable_not.sql creates 3 functions:

- `immutable_direct` is IMMUTABLE and tries to insert into table directly.
  PostgreSQL correctly detects and forbids this action.

- `volatile_direct` is VOLATILE and inserts into table directly.
  It is allowed and executed well.

- `immutable_indirect` is IMMUTABLE and calls `volatile_direct`.
  PostgreSQL failed to detect and prevent this DML manipulation.

Output:

select immutable_direct('immutable_direct');
psql:immutable_not.sql:28: ERROR:  INSERT is not allowed in a 
non-volatile function

CONTEXT:  SQL statement "insert into xxx values(j)"
PL/pgSQL function immutable_direct(character varying) line 3 at SQL 
statement


select volatile_direct('volatile_direct');
volatile_direct
-
volatile_direct
(1 row)

select immutable_indirect('immutable_indirect');
immutable_indirect

immutable_indirect
(1 row)

select * from xxx;
    i

volatile_direct
immutable_indirect
(2 rows)

Attached forbid-non-volatile-mutations.diff add checks readonly function 
didn't made data manipulations.

Output for patched version:

select immutable_indirect('immutable_indirect');
psql:immutable_not.sql:32: ERROR:  Damn2! Update were done in a 
non-volatile function

CONTEXT:  SQL statement "SELECT volatile_direct(j)"
PL/pgSQL function immutable_indirect(character varying) line 3 at PERFORM

I doubt check should be done this way. This check is necessary, but it 
should be

FATAL instead of ERROR. And ERROR should be generated at same place, when
it is generated for `immutable_direct`, but with check of "read_only" 
status through

whole call stack instead of just direct function kind.

-

regards,
Yura Sokolov
Postgres Professional






When IMMUTABLE is not.

2023-06-15 Thread Yura Sokolov

Good day, hackers.

I found, than declaration of function as IMMUTABLE/STABLE is not enough to be 
sure
function doesn't manipulate data.

In fact, SPI checks only direct function kind, but fails to check indirect call.

Attached immutable_not.sql creates 3 functions:

- `immutable_direct` is IMMUTABLE and tries to insert into table directly.
  PostgreSQL correctly detects and forbids this action.

- `volatile_direct` is VOLATILE and inserts into table directly.
  It is allowed and executed well.

- `immutable_indirect` is IMMUTABLE and calls `volatile_direct`.
  PostgreSQL failed to detect and prevent this DML manipulation.

Output:

select immutable_direct('immutable_direct'); psql:immutable_not.sql:28: 
ERROR:  INSERT is not allowed in a non-volatile function CONTEXT:  SQL 
statement "insert into xxx values(j)" PL/pgSQL function 
immutable_direct(character varying) line 3 at SQL statement select 
volatile_direct('volatile_direct'); volatile_direct - 
volatile_direct (1 row) select immutable_indirect('immutable_indirect'); 
immutable_indirect  immutable_indirect (1 row) 
select * from xxx; i  volatile_direct 
immutable_indirect (2 rows) Attached forbid-non-volatile-mutations.diff 
add checks readonly function didn't made data manipulations. Output for 
patched version: select immutable_indirect('immutable_indirect'); 
psql:immutable_not.sql:32: ERROR:  Damn2! Update were done in a 
non-volatile function CONTEXT:  SQL statement "SELECT 
volatile_direct(j)" PL/pgSQL function immutable_indirect(character 
varying) line 3 at PERFORM I doubt check should be done this way. This 
check is necessary, but it should be FATAL instead of ERROR. And ERROR 
should be generated at same place, when it is generated for 
`immutable_direct`, but with check of "read_only" status through whole 
call stack instead of just direct function kind. - regards, Yura 
Sokolov Postgres Professional


immutable_not.sql
Description: application/sql
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 33975687b38..82b6127d650 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -1584,6 +1584,7 @@ SPI_cursor_open_internal(const char *name, SPIPlanPtr plan,
 	Portal		portal;
 	SPICallbackArg spicallbackarg;
 	ErrorContextCallback spierrcontext;
+	CommandId 	this_command = InvalidCommandId;
 
 	/*
 	 * Check that the plan is something the Portal code will special-case as
@@ -1730,6 +1731,7 @@ SPI_cursor_open_internal(const char *name, SPIPlanPtr plan,
 	if (read_only)
 	{
 		ListCell   *lc;
+		this_command = GetCurrentCommandId(false);
 
 		foreach(lc, stmt_list)
 		{
@@ -1778,6 +1780,14 @@ SPI_cursor_open_internal(const char *name, SPIPlanPtr plan,
 	/* Pop the SPI stack */
 	_SPI_end_call(true);
 
+	if (read_only && this_command != GetCurrentCommandId(false))
+	{
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+		/* translator: %s is a SQL statement name */
+		errmsg("Damn1! Update were done in a non-volatile function")));
+	}
+
 	/* Return the created portal */
 	return portal;
 }
@@ -2404,6 +2414,7 @@ _SPI_execute_plan(SPIPlanPtr plan, const SPIExecuteOptions *options,
 	ErrorContextCallback spierrcontext;
 	CachedPlan *cplan = NULL;
 	ListCell   *lc1;
+	CommandId   this_command = InvalidCommandId;
 
 	/*
 	 * Setup error traceback support for ereport()
@@ -2473,6 +2484,11 @@ _SPI_execute_plan(SPIPlanPtr plan, const SPIExecuteOptions *options,
 (errcode(ERRCODE_SYNTAX_ERROR),
  errmsg("empty query does not return tuples")));
 
+	if (options->read_only)
+	{
+		this_command = GetCurrentCommandId(false);
+	}
+
 	foreach(lc1, plan->plancache_list)
 	{
 		CachedPlanSource *plansource = (CachedPlanSource *) lfirst(lc1);
@@ -2788,6 +2804,14 @@ _SPI_execute_plan(SPIPlanPtr plan, const SPIExecuteOptions *options,
 			CommandCounterIncrement();
 	}
 
+	if (options->read_only && this_command != GetCurrentCommandId(false))
+	{
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+		/* translator: %s is a SQL statement name */
+		errmsg("Damn2! Update were done in a non-volatile function")));
+	}
+
 fail:
 
 	/* Pop the snapshot off the stack if we pushed one */


Re: Let's make PostgreSQL multi-threaded

2023-06-07 Thread Yura Sokolov

07.06.2023 15:53, Robert Haas wrote:

Right now, if you need a bit
of additional session-local state, you just declare a variable and
you're all set. That's not a perfect system and does cause some
problems, but we can't go from there to a system where it's impossible
to add session-local state without hacking core.



or else it needs to
be design in some kind of extensible way that doesn't require it to
know the full details of every sort of object that's being used as
session-local state anywhere in the system.

And it is quite possible. Although with indirection involved.

For example, we want to add session variable "my_hello_var".
We first need to declare "offset variable".
Then register it in a session.
And then use function and/or macros to get actual address:

/* session.h */
extern size_t RegisterSessionVar(size_t size);
extern void* CurSessionVar(size_t offset);


/* session.c */
typedef struct Session {
char *vars;
} Session;

static _Thread_local Session* curSession;
static size_t sessionVarsSize = 0;
size_t
RegisterSessionVar(size_t size)
{
size_t off = sessionVarsSize;
sessionVarsSize += size;
return off;
}

void*

CurSession(size_t offset)
{
return curSession->vars + offset;
}

/* module_internal.h */

typedef int my_hello_var_t;
extern size_t my_hello_var_offset;

/* access macros */
#define my_hello_var 
(*(my_hello_var_t*)(CurSessionVar(my_hello_var_offset)))

/* module.c */
size_t my_hello_var_offset = 0;

void

PG_init() {
RegisterSessionVar(sizeof(my_hello_var_t), _hello_var_offset);
}

For security reasons, offset could be mangled.

--

regards,
Yura Sokolov





Re: Bug in jsonb_in function (14 & 15 version are affected)

2023-03-15 Thread Yura Sokolov
В Пн, 13/03/2023 в 13:58 -0400, Tom Lane пишет:
> Nikolay Shaplov  writes:
> > I found a bug in jsonb_in function (it converts json from sting 
> > representation
> >  into jsonb internal representation).
> 
> Yeah.  Looks like json_lex_string is failing to honor the invariant
> that it needs to set token_terminator ... although the documentation
> of the function certainly isn't helping.  I think we need the attached.
> 
> A nice side benefit is that the error context reports get a lot more
> useful --- somebody should have inquired before as to why they were
> so bogus.
> 
> regards, tom lane
> 

Good day, Tom and all.

Merged patch looks like start of refactoring.

Colleague (Nikita Glukhov) propose further refactoring of jsonapi.c:
- use of inline functions instead of macroses,
- more uniform their usage in token success or error reporting,
- simplify json_lex_number and its usage a bit.
Also he added tests for fixed bug.


-

Regards,
Yura Sokolov.
From 757a314d5fa9c6ba8334762b4a080763f02244c5 Mon Sep 17 00:00:00 2001
From: Nikita Glukhov 
Date: Tue, 14 Mar 2023 12:15:48 +0300
Subject: [PATCH] Refactor JSON lexer error reporting

Also add special tests for already fixed bug and for multibyte symbols after surrogates.
---
 src/common/jsonapi.c  | 246 +++---
 src/test/regress/expected/json_encoding.out   |  25 ++
 src/test/regress/expected/json_encoding_1.out |  23 ++
 src/test/regress/sql/json_encoding.sql|   7 +
 4 files changed, 149 insertions(+), 152 deletions(-)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 2e86589cfd8..d4a9d8f0378 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -44,8 +44,7 @@ typedef enum	/* contexts of JSON parser */
 } JsonParseContext;
 
 static inline JsonParseErrorType json_lex_string(JsonLexContext *lex);
-static inline JsonParseErrorType json_lex_number(JsonLexContext *lex, char *s,
- bool *num_err, int *total_len);
+static inline JsonParseErrorType json_lex_number(JsonLexContext *lex, char *s);
 static inline JsonParseErrorType parse_scalar(JsonLexContext *lex, JsonSemAction *sem);
 static JsonParseErrorType parse_object_field(JsonLexContext *lex, JsonSemAction *sem);
 static JsonParseErrorType parse_object(JsonLexContext *lex, JsonSemAction *sem);
@@ -104,8 +103,6 @@ lex_expect(JsonParseContext ctx, JsonLexContext *lex, JsonTokenType token)
 bool
 IsValidJsonNumber(const char *str, int len)
 {
-	bool		numeric_error;
-	int			total_len;
 	JsonLexContext dummy_lex;
 
 	if (len <= 0)
@@ -128,9 +125,8 @@ IsValidJsonNumber(const char *str, int len)
 		dummy_lex.input_length = len;
 	}
 
-	json_lex_number(_lex, dummy_lex.input, _error, _len);
-
-	return (!numeric_error) && (total_len == dummy_lex.input_length);
+	return json_lex_number(_lex, dummy_lex.input) == JSON_SUCCESS &&
+		dummy_lex.token_terminator == dummy_lex.input + dummy_lex.input_length;
 }
 
 /*
@@ -545,6 +541,37 @@ parse_array(JsonLexContext *lex, JsonSemAction *sem)
 	return JSON_SUCCESS;
 }
 
+/* Convenience inline functions for success/error exits from lexer */
+static inline JsonParseErrorType
+json_lex_success(JsonLexContext *lex, char *token_terminator,
+ JsonTokenType token_type)
+{
+	lex->prev_token_terminator = lex->token_terminator;
+	lex->token_terminator = token_terminator;
+	lex->token_type = token_type;
+
+	return JSON_SUCCESS;
+}
+
+static inline JsonParseErrorType
+json_lex_fail_at_char_start(JsonLexContext *lex, char *s,
+			JsonParseErrorType code)
+{
+	lex->token_terminator = s;
+
+	return code;
+}
+
+static inline JsonParseErrorType
+json_lex_fail_at_char_end(JsonLexContext *lex, char *s,
+		  JsonParseErrorType code)
+{
+	lex->token_terminator =
+		s + pg_encoding_mblen_bounded(lex->input_encoding, s);
+
+	return code;
+}
+
 /*
  * Lex one token from the input stream.
  */
@@ -553,7 +580,6 @@ json_lex(JsonLexContext *lex)
 {
 	char	   *s;
 	char	   *const end = lex->input + lex->input_length;
-	JsonParseErrorType result;
 
 	/* Skip leading whitespace. */
 	s = lex->token_terminator;
@@ -571,9 +597,7 @@ json_lex(JsonLexContext *lex)
 	if (s >= end)
 	{
 		lex->token_start = NULL;
-		lex->prev_token_terminator = lex->token_terminator;
-		lex->token_terminator = s;
-		lex->token_type = JSON_TOKEN_END;
+		return json_lex_success(lex, s, JSON_TOKEN_END);
 	}
 	else
 	{
@@ -581,49 +605,31 @@ json_lex(JsonLexContext *lex)
 		{
 /* Single-character token, some kind of punctuation mark. */
 			case '{':
-lex->prev_token_terminator = lex->token_terminator;
-lex->token_terminator = s + 1;
-lex->token_type = JSON_TOKEN_OBJECT_START;
-break;
+return json_lex_success(lex, s + 1, JSON_TOKEN_OBJECT_START);
+
 			case '}':
-lex->prev_token_terminator = lex->token_terminator;
-lex->token_terminator 

Re: Reducing the chunk header sizes on all memory context types

2022-09-07 Thread Yura Sokolov
В Ср, 07/09/2022 в 16:13 +1200, David Rowley пишет:
> On Tue, 6 Sept 2022 at 01:41, David Rowley  wrote:
> > 
> > On Fri, 2 Sept 2022 at 20:11, David Rowley  wrote:
> > > 
> > > On Thu, 1 Sept 2022 at 12:46, Tom Lane  wrote:
> > > > 
> > > > David Rowley  writes:
> > > > > Maybe we should just consider always making room for a sentinel for
> > > > > chunks that are on dedicated blocks. At most that's an extra 8 bytes
> > > > > in some allocation that's either over 1024 or 8192 (depending on
> > > > > maxBlockSize).
> > > > 
> > > > Agreed, if we're not doing that already then we should.
> > > 
> > > Here's a patch to that effect.
> > 
> > If there are no objections, then I plan to push that patch soon.
> 
> I've now pushed the patch which adds the sentinel space in more cases.
> 
> The final analysis I did on the stats gathered during make
> installcheck show that we'll now allocate about 19MBs more over the
> entire installcheck run out of about 26GBs total allocations.
> 
> That analysis looks something like:
> 
> Before:
> 
> SELECT CASE
>  WHEN pow2_size > 0
>   AND pow2_size = size THEN 'No'
>  WHEN pow2_size = 0
>   AND size = maxalign_size THEN 'No'
>  ELSE 'Yes'
>    END    AS has_sentinel,
>    Count(*)   AS n_allocations,
>    Sum(CASE
>  WHEN pow2_size > 0 THEN pow2_size
>  ELSE maxalign_size
>    END) / 1024 / 1024 mega_bytes_alloc
> FROM   memstats
> GROUP  BY 1;
> has_sentinel | n_allocations | mega_bytes_alloc
> --+---+--
>  No   |  26445855 |    21556
>  Yes  |  37602052 | 5044
> 
> After:
> 
> SELECT CASE
>  WHEN pow2_size > 0
>   AND pow2_size = size THEN 'No'
>  WHEN pow2_size = 0
>   AND size = maxalign_size THEN 'Yes' -- this part changed
>  ELSE 'Yes'
>    END    AS has_sentinel,
>    Count(*)   AS n_allocations,
>    Sum(CASE
>  WHEN pow2_size > 0 THEN pow2_size
>  WHEN size = maxalign_size THEN maxalign_size + 8
>  ELSE maxalign_size
>    END) / 1024 / 1024 mega_bytes_alloc
> FROM   memstats
> GROUP  BY 1;
> has_sentinel | n_allocations | mega_bytes_alloc
> --+---+--
>  No   |  23980527 | 2177
>  Yes  |  40067380 |    24442
> 
> That amounts to previously having about 58.7% of allocations having a
> sentinel up to 62.6% currently, during the installcheck run.
> 
> It seems a pretty large portion of allocation request sizes are
> power-of-2 sized and use AllocSet.

19MB over 26GB is almost nothing. If it is only for enable-casserts
builds, then it is perfectly acceptable.

regards
Yura


plpython causes assertions with python debug build

2022-08-17 Thread Yura Sokolov
Hello, hackers.

I was investigating Valgrind issues with plpython. It turns out
python itself doesn't play well with Valgrind in default build.

Therefore I built python with valgrind related flags
--with-valgrind --without-pymalloc
and added debug flags just to be sure
--with-pydebug --with-assertions

It causes plpython's tests to fail on internal python's
assertions.
Example backtrace (python version 3.7, postgresql master branch):

#8  0x7fbf02851662 in __GI___assert_fail "!PyErr_Occurred()"
at assert.c:101
#9  0x7fbef9060d31 in _PyType_Lookup
at Objects/typeobject.c:3117
 
#10 0x7fbef90461be in _PyObject_GenericGetAttrWithDict
at Objects/object.c:1231
#11 0x7fbef9046707 in PyObject_GenericGetAttr
at Objects/object.c:1309
 
#12 0x7fbef9043cdf in PyObject_GetAttr
at Objects/object.c:913
#13 0x7fbef90458d9 in PyObject_GetAttrString
at Objects/object.c:818
#14 0x7fbf02499636 in get_string_attr
at plpy_elog.c:569
#15 0x7fbf02498ea5 in PLy_get_error_data
at plpy_elog.c:420
#16 0x7fbf0249763b in PLy_elog_impl
at plpy_elog.c:77

Looks like there several places where code tries to get
attributes from error objects, and while code is ready for
attribute absence, it doesn't clear AttributeError exception
in that case.

Attached patch adds 3 calls to PyErr_Clear() in places where
code reacts on attribute absence. With this patch tests are
passed well.

There were similar findings before. Calls to PyErr_Clear were
close to, but not exactly at, same places before were removed
in
  7e3bb08038 Fix access-to-already-freed-memory issue in plpython's error 
handling.
Then one of PyErr_Clear were added in
  1d2f9de38d Fix freshly-introduced PL/Python portability bug.
But looks like there's need for more.

PS. When python is compilled `--with-valgrind --without-pymalloc`
Valgrind doesn't complain, so there are no memory related
issues in plpython.

regards

------

Yura Sokolov
y.sokolov
diff --git a/src/pl/plpython/plpy_elog.c b/src/pl/plpython/plpy_elog.c
index 7c627eacfbf..aa1cf17b366 100644
--- a/src/pl/plpython/plpy_elog.c
+++ b/src/pl/plpython/plpy_elog.c
@@ -359,7 +359,10 @@ PLy_get_sqlerrcode(PyObject *exc, int *sqlerrcode)
 
 	sqlstate = PyObject_GetAttrString(exc, "sqlstate");
 	if (sqlstate == NULL)
+	{
+		PyErr_Clear();
 		return;
+	}
 
 	buffer = PLyUnicode_AsString(sqlstate);
 	if (strlen(buffer) == 5 &&
@@ -395,6 +398,7 @@ PLy_get_spi_error_data(PyObject *exc, int *sqlerrcode, char **detail,
 	}
 	else
 	{
+		PyErr_Clear();
 		/*
 		 * If there's no spidata, at least set the sqlerrcode. This can happen
 		 * if someone explicitly raises a SPI exception from Python code.
@@ -571,6 +575,10 @@ get_string_attr(PyObject *obj, char *attrname, char **str)
 	{
 		*str = pstrdup(PLyUnicode_AsString(val));
 	}
+	else if (val == NULL)
+	{
+		PyErr_Clear();
+	}
 	Py_XDECREF(val);
 }
 


Re: optimize lookups in snapshot [sub]xip arrays

2022-07-24 Thread Yura Sokolov
В Ср, 13/07/2022 в 10:09 -0700, Nathan Bossart пишет:
> Hi hackers,
> 
> A few years ago, there was a proposal to create hash tables for long
> [sub]xip arrays in snapshots [0], but the thread seems to have fizzled out.
> I was curious whether this idea still showed measurable benefits, so I
> revamped the patch and ran the same test as before [1].  Here are the
> results for 60₋second runs on an r5d.24xlarge with the data directory on
> the local NVMe storage:
> 
>  writers  HEAD  patch  diff
>     
>  16   659   664    +1%
>  32   645   663    +3%
>  64   659   692    +5%
>  128  641   716    +12%
>  256  619   610    -1%
>  512  530   702    +32%
>  768  469   582    +24%
>  1000 367   577    +57%
> 
> As before, the hash table approach seems to provide a decent benefit at
> higher client counts, so I felt it was worth reviving the idea.
> 
> The attached patch has some key differences from the previous proposal.
> For example, the new patch uses simplehash instead of open-coding a new
> hash table.  Also, I've bumped up the threshold for creating hash tables to
> 128 based on the results of my testing.  The attached patch waits until a
> lookup of [sub]xip before generating the hash table, so we only need to
> allocate enough space for the current elements in the [sub]xip array, and
> we avoid allocating extra memory for workloads that do not need the hash
> tables.  I'm slightly worried about increasing the number of memory
> allocations in this code path, but the results above seemed encouraging on
> that front.
> 
> Thoughts?
> 
> [0] https://postgr.es/m/35960b8af917e9268881cd8df3f88320%40postgrespro.ru
> [1] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9b3e%402ndquadrant.com
> 

I'm glad my idea has been reborn.

Well, may be simplehash is not bad idea.
While it certainly consumes more memory and CPU instructions.

I'll try to review.

regards,

Yura Sokolov




Re: SLRUs in the main buffer pool, redux

2022-07-21 Thread Yura Sokolov
Good day, Thomas

В Пт, 27/05/2022 в 23:24 +1200, Thomas Munro пишет:
> Rebased, debugged and fleshed out a tiny bit more, but still with
> plenty of TODO notes and questions.  I will talk about this idea at
> PGCon, so I figured it'd help to have a patch that actually applies,
> even if it doesn't work quite right yet.  It's quite a large patch but
> that's partly because it removes a lot of lines...

Looks like it have to be rebased again.






Re: MultiXact\SLRU buffers configuration

2022-07-21 Thread Yura Sokolov
Good day, all.

I did benchmark of patch on 2 socket Xeon 5220 CPU @ 2.20GHz .
I used "benchmark" used to reproduce problems with SLRU on our
customers setup.
In opposite to Shawn's tests I concentrated on bad case: a lot
of contention.

slru-funcs.sql - function definitions
  - functions creates a lot of subtrunsactions to stress subtrans
  - and select random rows for share to stress multixacts

slru-call.sql - function call for benchmark

slru-ballast.sql - randomly select 1000 consequent rows
"for update skip locked" to stress multixacts

patch1 - make SLRU buffers configurable
patch2 - make "8-associative banks"

Benchmark done by pgbench.
Inited with scale 1 to induce contention.
pgbench -i -s 1 testdb

Benchmark 1:
- low number of connections (50), 60% slru-call, 40% slru-ballast
pgbench -f slru-call.sql@60 -f slru-ballast.sql@40 -c 50 -j 75 -P 1 -T 30 
testdb

version | subtrans | multixact | tps
| buffers  | offs/memb | func+ballast
+--+---+--
master  | 32   | 8/16  | 184+119
patch1  | 32   | 8/16  | 184+119
patch1  | 1024 | 8/16  | 121+77
patch1  | 1024 | 512/1024  | 118+75
patch2  | 32   | 8/16  | 190+122
patch2  | 1024 | 8/16  | 190+125
patch2  | 1024 | 512/1024  | 190+127

As you see, this test case degrades with dumb increase of
SLRU buffers. But use of "hash table" in form of "associative
buckets" makes performance stable.

Benchmark 2:
- high connection number (600), 98% slru-call, 2% slru-ballast
pgbench -f slru-call.sql@98 -f slru-ballast.sql@2 -c 600 -j 75 -P 1 -T 30 
testdb

I don't paste "ballast" tps here since 2% make them too small,
and they're very noisy.

version | subtrans | multixact | tps
| buffers  | offs/memb | func
+--+---+--
master  | 32   | 8/16  | 13
patch1  | 32  
| 8/16  | 13
patch1  | 1024 | 8/16  | 31
patch1  | 1024 | 512/1024  | 53
patch2  | 32   | 8/16  | 12
patch2  | 1024 | 8/16  | 34
patch2  | 1024 | 512/1024  | 67

In this case simple buffer increase does help. But "buckets"
increase performance gain.

I didn't paste here results third part of patch ("Pack SLRU...")
because I didn't see any major performance gain from it, and
it consumes large part of patch diff.

Rebased versions of first two patch parts are attached.

regards,

Yura Sokolov


slru-ballast.sql
Description: application/sql


slru-call.sql
Description: application/sql


slru-func.sql
Description: application/sql
From 41ec9d1c54184c515d53ecc8021c4a998813f2a9 Mon Sep 17 00:00:00 2001
From: Andrey Borodin 
Date: Sun, 11 Apr 2021 21:18:10 +0300
Subject: [PATCH v21 2/2] Divide SLRU buffers into 8-associative banks

We want to eliminate linear search within SLRU buffers.
To do so we divide SLRU buffers into banks. Each bank holds
approximately 8 buffers. Each SLRU pageno may reside only in one bank.
Adjacent pagenos reside in different banks.
---
 src/backend/access/transam/slru.c | 43 ---
 src/include/access/slru.h |  2 ++
 2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index b65cb49d7ff..abc534bbd06 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -134,7 +134,7 @@ typedef enum
 static SlruErrorCause slru_errcause;
 static int	slru_errno;
 
-
+static void SlruAdjustNSlots(int* nslots, int* banksize, int* bankoffset);
 static void SimpleLruZeroLSNs(SlruCtl ctl, int slotno);
 static void SimpleLruWaitIO(SlruCtl ctl, int slotno);
 static void SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata);
@@ -148,6 +148,30 @@ static bool SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename,
 	  int segpage, void *data);
 static void SlruInternalDeleteSegment(SlruCtl ctl, int segno);
 
+/*
+ * Pick bank size optimal for N-assiciative SLRU buffers.
+ * We expect bank number to be picked from lowest bits of requested pageno.
+ * Thus we want number of banks to be power of 2. This routine computes number
+ * of banks aiming to make each bank of size 8. So we can pack page number and
+ * statuses of each bank on one cacheline.
+ */
+static void SlruAdjustNSlots(int* nslots, int* banksize, int* bankoffset)
+{
+	int nbanks = 1;
+	*banksize = *nslots;
+	*bankoffset = 0;
+	while (*banksize > 15)
+	{
+		if ((*banksize & 1) != 0)
+			*banksize +=1;
+		*banksize /= 2;
+		nbanks *= 2;
+		*bankoffset += 1;
+	}
+	elog(DEBUG5, "nslots %d banksize %d nbanks %d ", *nslots, *banksize, nbanks);
+	*nslots = *banksize * nbanks;
+}
+
 /*
  * Initialization of shared memory
  */
@@ -156,6 +180,8 @@ Size
 SimpleLruShmemSize(int nslots, int nlsns)
 {
 	Size		sz;
+	int bankoffset, banksize;
+	SlruAdjustNSlots(, , );
 
 	/* we assume nslots isn't so large as to risk overflow */
 

Re: Reducing the chunk header sizes on all memory context types

2022-07-17 Thread Yura Sokolov
В Вт, 12/07/2022 в 22:41 -0700, Andres Freund пишет:
> Hi,
> 
> On 2022-07-12 10:42:07 -0700, Andres Freund wrote:
> > On 2022-07-12 17:01:18 +1200, David Rowley wrote:
> > > There is at least one. It might be major; to reduce the AllocSet chunk
> > > header from 16 bytes down to 8 bytes I had to get rid of the freelist
> > > pointer that was reusing the "aset" field in the chunk header struct.
> > > This works now by storing that pointer in the actual palloc'd memory.
> > > This could lead to pretty hard-to-trace bugs if we have any code that
> > > accidentally writes to memory after pfree.
> > 
> > Can't we use the same trick for allcations in the freelist as we do for the
> > header in a live allocation? I.e. split the 8 byte header into two and use
> > part of it to point to the next element in the list using the offset from 
> > the
> > start of the block, and part of it to indicate the size?
> 
> So that doesn't work because the members in the freelist can be in different
> blocks and those can be further away from each other.
> 
> 
> Perhaps that could still made work somehow: To point to a block we don't
> actually need 64bit pointers, they're always at least of some certain size -
> assuming we can allocate them suitably aligned. And chunks are always 8 byte
> aligned. Unfortunately that doesn't quite get us far enough - assuming a 4kB
> minimum block size (larger than now, but potentially sensible as a common OS
> page size), we still only get to 2^12*8 = 32kB.

Well, we actually have freelists for 11 size classes.
It is just 11 pointers.
We could embed this 88 bytes in every ASet block and then link blocks.
And then in every block have 44 bytes for in-block free lists.
Total overhead is 132 bytes per-block.
Or 110 if we limit block size to 65k*8b=512kb.

With double-linked block lists (176bytes per block + 44bytes for in-block lists
= 220bytes), we could track block fullness and deallocate it if it doesn't
contain any alive alocation. Therefore "generational" and "slab" allocators
will be less useful.

But CPU overhead will be noticeable.

> It'd easily work if we made each context have an array of allocated non-huge
> blocks, so that the blocks can be addressed with a small index. The overhead
> of that could be reduced in the common case by embedding a small constant
> sized array in the Aset.  That might actually be worth trying out.
> 
> Greetings,
> 
> Andres Freund
> 
> 





Re: Reducing the chunk header sizes on all memory context types

2022-07-12 Thread Yura Sokolov
Good day, David.

В Вт, 12/07/2022 в 17:01 +1200, David Rowley пишет:
> Over on [1], I highlighted that 40af10b57 (Use Generation memory
> contexts to store tuples in sorts) could cause some performance
> regressions for sorts when the size of the tuple is exactly a power of
> 2.  The reason for this is that the chunk header for generation.c
> contexts is 8 bytes larger (on 64-bit machines) than the aset.c chunk
> header. This means that we can store fewer tuples per work_mem during
> the sort and that results in more batches being required.
> 
> As I write this, this regression is still marked as an open item for
> PG15 in [2]. So I've been working on this to try to assist the
> discussion about if we need to do anything about that for PG15.
> 
> Over on [3], I mentioned that Andres came up with an idea and a
> prototype patch to reduce the chunk header size across the board by
> storing the context type in the 3 least significant bits in a uint64
> header.
> 
> I've taken Andres' patch and made some quite significant changes to
> it. In the patch's current state, the sort performance regression in
> PG15 vs PG14 is fixed. The generation.c context chunk header has been
> reduced to 8 bytes from the previous 24 bytes as it is in master.
> aset.c context chunk header is now 8 bytes instead of 16 bytes.
> 
> We can use this 8-byte chunk header by using the remaining 61-bits of
> the uint64 header to encode 2 30-bit values to store the chunk size
> and also the number of bytes we must subtract from the given pointer
> to find the block that the chunk is stored on.  Once we have the
> block, we can find the owning context by having a pointer back to the
> context from the block.  For memory allocations that are larger than
> what can be stored in 30 bits, the chunk header gets an additional two
> Size fields to store the chunk_size and the block offset.  We can tell
> the difference between the 2 chunk sizes by looking at the spare 1-bit
> the uint64 portion of the header.

I don't get, why "large chunk" needs additional fields for size and
offset.
Large allocation sizes are certainly rounded to page size.
And allocations which doesn't fit 1GB we could easily round to 1MB.
Then we could simply store `size>>20`.
It will limit MaxAllocHugeSize to `(1<<(30+20))-1` - 1PB. Doubdfully we
will deal with such huge allocations in near future.

And to limit block offset, we just have to limit maxBlockSize to 1GB,
which is quite reasonable limitation.
Chunks that are larger than maxBlockSize goes to separate blocks anyway,
therefore they have small block offset.

> Aside from speeding up the sort case, this also seems to have a good
> positive performance impact on pgbench read-only workload with -M
> simple. I'm seeing about a 17% performance increase on my AMD
> Threadripper machine.
> 
> master + Reduced Memory Context Chunk Overhead
> drowley@amd3990x:~$ pgbench -S -T 60 -j 156 -c 156 -M simple postgres
> tps = 1876841.953096 (without initial connection time)
> tps = 1919605.408254 (without initial connection time)
> tps = 1891734.480141 (without initial connection time)
> 
> Master
> drowley@amd3990x:~$ pgbench -S -T 60 -j 156 -c 156 -M simple postgres
> tps = 1632248.236962 (without initial connection time)
> tps = 1615663.151604 (without initial connection time)
> tps = 1602004.146259 (without initial connection time)

Trick with 3bit context type is great.

> The attached .png file shows the same results for PG14 and PG15 as I
> showed in the blog [4] where I discovered the regression and adds the
> results from current master + the attached patch. See bars in orange.
> You can see that the regression at 64MB work_mem is fixed. Adding some
> tracing to the sort shows that we're now doing 671745 tuples per batch
> instead of 576845 tuples. This reduces the number of batches from 245
> down to 210.
> 
> Drawbacks:
> 
> There is at least one. It might be major; to reduce the AllocSet chunk
> header from 16 bytes down to 8 bytes I had to get rid of the freelist
> pointer that was reusing the "aset" field in the chunk header struct.
> This works now by storing that pointer in the actual palloc'd memory.
> This could lead to pretty hard-to-trace bugs if we have any code that
> accidentally writes to memory after pfree.  The slab.c context already
> does this, but that's far less commonly used.  If we decided this was
> unacceptable then it does not change anything for the generation.c
> context.  The chunk header will still be 8 bytes instead of 24 there.
> So the sort performance regression will still be fixed.

At least we can still mark free list pointer as VALGRIND_MAKE_MEM_NOACCESS
and do VALGRIND_MAKE_MEM_DEFINED on fetching from free list, can we?

> To improve this situation, we might be able code it up so that
> MEMORY_CONTEXT_CHECKING builds add an additional freelist pointer to
> the header and also write it to the palloc'd memory then verify
> they're set to the same thing when we reuse a chunk from the 

Re: [PATCH] fix wait_event of pg_stat_activity in case of high amount of connections

2022-07-08 Thread Yura Sokolov
В Сб, 09/07/2022 в 02:32 +0300, Yura Sokolov пишет:
> В Пт, 08/07/2022 в 11:04 -0400, Robert Haas пишет:
> > On Fri, Jul 8, 2022 at 10:11 AM Yura Sokolov  
> > wrote:
> > > I see analogy with Bus Stop:
> > > - there is bus stop
> > > - there is a schedule of bus arriving this top
> > > - there are passengers, who every day travel with this bus
> > > 
> > > Bus occasionally comes later... Well, it comes later quite often...
> > > 
> > > Which way Major (or other responsible person) should act?
> > 
> > I do not think that is a good analogy, because a bus schedule is an
> > implicit promise - or at least a strong suggestion - that the bus will
> > arrive at the scheduled time.
> 
> There is implicit promise: those data are written in single row.
> If you want to notice they are NOT related to each other, return them
> in different rows or even in different view tables.
> 
> > In this case, who made such a promise? The original post presents it
> > as fact that these systems should give compatible answers at all
> > times, but there's nothing in the code or documentation to suggest
> > that this is true.
> > 
> > IMHO, a better analogy would be if you noticed that the 7:03am bus was
> > normally blue and you took that one because you have a small child who
> > likes the color blue and it makes them happy to take a blue bus. And
> > then one day the bus at that time is a red bus and your child is upset
> > and you call the major (or other responsible person) to complain.
> > They're probably not going to handle that situation by trying to send
> > a blue bus at 7:03am as often as possible. They're going to tell you
> > that they only promised you a bus at 7:03am, not what color it would
> > be.
> > 
> > Perhaps that's not an ideal analogy either, because the reported wait
> > event and the reported activity are more closely related than the time
> > of a bus is to the color of the bus. But I think it's still true that
> > nobody ever promised that those values would be compatible with each
> > other, and that's not really fixable, and that there are lots of other
> > cases just like this one which can't be fixed either.
> > 
> > I think that the more we try to pretend like it is possible to make
> > these values seem like they are synchronized, the more unhappy people
> > will be in the unavoidable cases where they aren't, and the more
> > pressure there will be to try to tighten it up even further. That's
> > likely to result in code that is more complex and slower, which I do
> > not want, and especially not for the sake of avoiding a harmless
> > reporting discrepancy.
> 
> Then just don't return them together, right?

Well, I'm a bit hotter guy than it is needed. I appologize for that.

Lets look on situation from compromise point of view:
- We are telling: we could make this view more synchronous (and faster).
- You are telling: it will never be totally synchronous, and it is
  mistake we didn't mention the issue in documentation.

Why don't do both?
Why can't we do it more synchronous (and faster) AND mention in
documentaion it is not totally synchronous and never will be?



regards

Yura





Re: [PATCH] fix wait_event of pg_stat_activity in case of high amount of connections

2022-07-08 Thread Yura Sokolov
В Пт, 08/07/2022 в 11:04 -0400, Robert Haas пишет:
> On Fri, Jul 8, 2022 at 10:11 AM Yura Sokolov  wrote:
> > I see analogy with Bus Stop:
> > - there is bus stop
> > - there is a schedule of bus arriving this top
> > - there are passengers, who every day travel with this bus
> > 
> > Bus occasionally comes later... Well, it comes later quite often...
> > 
> > Which way Major (or other responsible person) should act?
> 
> I do not think that is a good analogy, because a bus schedule is an
> implicit promise - or at least a strong suggestion - that the bus will
> arrive at the scheduled time.

There is implicit promise: those data are written in single row.
If you want to notice they are NOT related to each other, return them
in different rows or even in different view tables.

> In this case, who made such a promise? The original post presents it
> as fact that these systems should give compatible answers at all
> times, but there's nothing in the code or documentation to suggest
> that this is true.
> 
> IMHO, a better analogy would be if you noticed that the 7:03am bus was
> normally blue and you took that one because you have a small child who
> likes the color blue and it makes them happy to take a blue bus. And
> then one day the bus at that time is a red bus and your child is upset
> and you call the major (or other responsible person) to complain.
> They're probably not going to handle that situation by trying to send
> a blue bus at 7:03am as often as possible. They're going to tell you
> that they only promised you a bus at 7:03am, not what color it would
> be.
> 
> Perhaps that's not an ideal analogy either, because the reported wait
> event and the reported activity are more closely related than the time
> of a bus is to the color of the bus. But I think it's still true that
> nobody ever promised that those values would be compatible with each
> other, and that's not really fixable, and that there are lots of other
> cases just like this one which can't be fixed either.
> 
> I think that the more we try to pretend like it is possible to make
> these values seem like they are synchronized, the more unhappy people
> will be in the unavoidable cases where they aren't, and the more
> pressure there will be to try to tighten it up even further. That's
> likely to result in code that is more complex and slower, which I do
> not want, and especially not for the sake of avoiding a harmless
> reporting discrepancy.

Then just don't return them together, right?





Re: [PATCH] fix wait_event of pg_stat_activity in case of high amount of connections

2022-07-08 Thread Yura Sokolov
В Пт, 08/07/2022 в 09:44 -0400, Robert Haas пишет:
> On Thu, Jul 7, 2022 at 10:39 PM Kyotaro Horiguchi
>  wrote:
> > At Thu, 7 Jul 2022 13:58:06 -0500, Justin Pryzby  
> > wrote in
> > > I agree that this is a bug, since it can (and did) cause false positives 
> > > in a
> > > monitoring system.
> > 
> > I'm not this is undoubtfully a bug but agree about the rest.
> 
> I don't agree that this is a bug, and even if it were, I don't think
> this patch can fix it.
> 
> Let's start with the second point first: pgstat_report_wait_start()
> and pgstat_report_wait_end() change the advertised wait event for a
> process, while the backend state is changed by
> pgstat_report_activity(). Since those function calls are in different
> places, those changes are bound to happen at different times, and
> therefore you can observe drift between the two values. Now perhaps
> there are some one-directional guarantees: I think we probably always
> set the state to idle before we start reading from the client, and
> always finish reading from the client before the state ceases to be
> idle. But I don't really see how that helps anything, because when you
> read those values, you must read one and then the other. If you read
> the activity before the wait event, you might see the state before it
> goes idle and then the wait event after it's reached ClientRead. If
> you read the wait event before the activity, you might see the wait
> event as ClientRead, and then by the time you check the activity the
> backend might have gotten some data from the client and no longer be
> idle. The very best a patch like this can hope to do is narrow the
> race condition enough that the discrepancies are observed less
> frequently in practice.
> 
> And that's why I think this is not a bug fix, or even a good idea.
> It's just encouraging people to rely on something which can never be
> fully reliable in the way that the original poster is hoping. There
> was never any intention of having wait events synchronized with the
> pgstat_report_activity() stuff, and I think that's perfectly fine.
> Both systems are trying to provide visibility into states that can
> change very quickly, and therefore they need to be low-overhead, and
> therefore they use very lightweight synchronization, which means that
> ephemeral discrepancies are possible by nature. There are plenty of
> other examples of that as well. You can't for example query pg_locks
> and pg_stat_activity in the same query and expect that all and only
> those backends that are apparently waiting for a lock in
> pg_stat_activity will have an ungranted lock in pg_locks. It just
> doesn't work like that, and there's a very good reason for that:
> trying to make all of these introspection facilities behave in
> MVCC-like ways would be painful to code and probably end up slowing
> the system down substantially.
> 
> I think the right fix here is to change nothing in the code, and stop
> expecting these things to be perfectly consistent with each other.

I see analogy with Bus Stop:
- there is bus stop
- there is a schedule of bus arriving this top
- there are passengers, who every day travel with this bus

Bus occasionally comes later... Well, it comes later quite often...

Which way Major (or other responsible person) should act?

First possibility: do all the best Bus comes at the schedule. And
although there will no be 100% guarantee, it will raise from 90%
to 99%.

Second possibility: tell the passengers "you should not rely on bus
schedule, and we will not do anything to make it more reliable".

If I were passenger, I'd prefer first choice.


regards

Yura





Re: BufferAlloc: don't take two simultaneous locks

2022-06-28 Thread Yura Sokolov
В Вт, 28/06/2022 в 14:26 +0300, Yura Sokolov пишет:
> В Вт, 28/06/2022 в 14:13 +0300, Yura Sokolov пишет:
> 
> > Tests:
> > - tests done on 2 socket Xeon 5220 2.20GHz with turbo bust disabled
> >   (ie max frequency is 2.20GHz)
> 
> Forgot to mention:
> - this time it was Centos7.9.2009 (Core) with Linux mn10 
> 3.10.0-1160.el7.x86_64
> 
> Perhaps older kernel describes poor master's performance on 2 sockets
> compared to my previous results (when this server had Linux 5.10.103-1 
> Debian).
> 
> Or there is degradation in PostgreSQL's master branch between.
> I'll try to check today.

No, old master commit ( 7e12256b47 Sat Mar 12 14:21:40 2022) behaves same.
So it is clearly old-kernel issue. Perhaps, futex was much slower than this
days.





Re: BufferAlloc: don't take two simultaneous locks

2022-06-28 Thread Yura Sokolov
В Вт, 28/06/2022 в 14:13 +0300, Yura Sokolov пишет:

> Tests:
> - tests done on 2 socket Xeon 5220 2.20GHz with turbo bust disabled
>   (ie max frequency is 2.20GHz)

Forgot to mention:
- this time it was Centos7.9.2009 (Core) with Linux mn10 3.10.0-1160.el7.x86_64

Perhaps older kernel describes poor master's performance on 2 sockets
compared to my previous results (when this server had Linux 5.10.103-1 Debian).

Or there is degradation in PostgreSQL's master branch between.
I'll try to check today.

regards

---

Yura Sokolov





Re: PG15 beta1 sort performance regression due to Generation context change

2022-06-02 Thread Yura Sokolov
Fr, 27/05/2022 в 10:51 -0400, Tom Lane writes:
> Yura Sokolov  writes:
> > В Вт, 24/05/2022 в 17:39 -0700, Andres Freund пишет:
> > > A variation on your patch would be to only store the offset to the block
> > > header - that should always fit into 32bit (huge allocations being their 
> > > own
> > > block, which is why this wouldn't work for storing an offset to the
> > > context).
> > I'm +1 for this.
> 
> Given David's results in the preceding message, I don't think I am.

But David did the opposite: he removed pointer to block and remain
pointer to context. Then code have to do bsearch to find actual block.

> A scheme like this would add more arithmetic and at least one more
> indirection to GetMemoryChunkContext(), and we already know that
> adding even a test-and-branch there has measurable cost.  (I wonder
> if using unlikely() on the test would help?  But it's not unlikely
> in a generation-context-heavy use case.)

Well, it should be tested.

> There would also be a good
> deal of complication and ensuing slowdown created by the need for
> oversize chunks to be a completely different kind of animal with a
> different header.

Why? encoded_size could handle both small sizes and larges sizes
given actual (not requested) allocation size is rounded to page size.
There's no need to different chunk header.

> I'm also not very happy about this:
>
> > And with this change every memory context kind can have same header:
> 
> IMO that's a bug not a feature.  It puts significant constraints on how
> context types can be designed.

Nothing prevents to add additional data before common header.


regards

Yura





Re: PG15 beta1 sort performance regression due to Generation context change

2022-05-27 Thread Yura Sokolov
В Вт, 24/05/2022 в 17:39 -0700, Andres Freund пишет:
> 
> A variation on your patch would be to only store the offset to the block
> header - that should always fit into 32bit (huge allocations being their own
> block, which is why this wouldn't work for storing an offset to the
> context). With a bit of care that'd allow aset.c to half it's overhead, by
> using 4 bytes of space for all non-huge allocations.  Of course, it'd increase
> the cost of pfree() of small allocations, because AllocSetFree() currently
> doesn't need to access the block for those. But I'd guess that'd be outweighed
> by the reduced memory usage.

I'm +1 for this.

And with this change every memory context kind can have same header:

typedef struct MemoryChunk {
#ifdef MEMORY_CONTEXT_CHECKING
Size   requested_size;
#endif
uint32 encoded_size;/* encoded allocation size */
uint32 offset_to_block; /* backward offset to block header */
}

Allocated size always could be encoded into uint32 since it is rounded
for large allocations (I believe, large allocations certainly rounded
to at least 4096 bytes):

encoded_size = size < (1u<<31) ? size : (1u<<31)|(size>>12);
/* and reverse */
size = (encoded_size >> 31) ? ((Size)(encoded_size<<1)<<12) :
   (Size)encoded_size;

There is a glitch with Aset since it currently reuses `aset` pointer
for freelist link. With such change this link had to be encoded in
chunk-body itself instead of header. I was confused with this, since
there are valgrind hooks, and I was not sure how to change it (I'm
not good at valgrind hooks). But after thinking more about I believe
it is doable.


regards

---

Yura Sokolov





Re: BufferAlloc: don't take two simultaneous locks

2022-05-10 Thread Yura Sokolov
В Пт, 06/05/2022 в 10:26 -0400, Robert Haas пишет:
> On Thu, Apr 21, 2022 at 6:58 PM Yura Sokolov  wrote:
> > At the master state:
> > - SharedBufHash is not declared as HASH_FIXED_SIZE
> > - get_hash_entry falls back to element_alloc too fast (just if it doesn't
> >   found free entry in current freelist partition).
> > - get_hash_entry has races.
> > - if there are small number of spare items (and NUM_BUFFER_PARTITIONS is
> >   small number) and HASH_FIXED_SIZE is set, it becomes contended and
> >   therefore slow.
> > 
> > HASH_REUSE solves (for shared buffers) most of this issues. Free list
> > became rare fallback, so HASH_FIXED_SIZE for SharedBufHash doesn't lead
> > to performance hit. And with fair number of spare items, get_hash_entry
> > will find free entry despite its races.
> 
> Hmm, I see. The idea of trying to arrange to reuse entries rather than
> pushing them onto a freelist and immediately trying to take them off
> again is an interesting one, and I kind of like it. But I can't
> imagine that anyone would commit this patch the way you have it. It's
> way too much action at a distance. If any ereport(ERROR,...) could
> happen between the HASH_REUSE operation and the subsequent HASH_ENTER,
> it would be disastrous, and those things are separated by multiple
> levels of call stack across different modules, so mistakes would be
> easy to make. If this could be made into something dynahash takes care
> of internally without requiring extensive cooperation with the calling
> code, I think it would very possibly be accepted.
> 
> One approach would be to have a hash_replace() call that takes two
> const void * arguments, one to delete and one to insert. Then maybe
> you propagate that idea upward and have, similarly, a BufTableReplace
> operation that uses that, and then the bufmgr code calls
> BufTableReplace instead of BufTableDelete. Maybe there are other
> better ideas out there...

No.

While HASH_REUSE is a good addition to overall performance improvement
of the patch, it is not required for major gain.

Major gain is from not taking two partition locks simultaneously.

hash_replace would require two locks, so it is not an option.

regards

-

Yura





Re: Multi-Master Logical Replication

2022-04-28 Thread Yura Sokolov
В Чт, 28/04/2022 в 17:37 +0530, vignesh C пишет:
> On Thu, Apr 28, 2022 at 4:24 PM Yura Sokolov  wrote:
> > В Чт, 28/04/2022 в 09:49 +1000, Peter Smith пишет:
> > 
> > > 1.1 ADVANTAGES OF MMLR
> > > 
> > > - Increases write scalability (e.g., all nodes can write arbitrary data).
> > 
> > I've never heard how transactional-aware multimaster increases
> > write scalability. More over, usually even non-transactional
> > multimaster doesn't increase write scalability. At the best it
> > doesn't decrease.
> > 
> > That is because all hosts have to write all changes anyway. But
> > side cost increases due to increased network interchange and
> > interlocking (for transaction-aware MM) and increased latency.
> 
> I agree it won't increase in all cases, but it will be better in a few
> cases when the user works on different geographical regions operating
> on independent schemas in asynchronous mode. Since the write node is
> closer to the geographical zone, the performance will be better in a
> few cases.

>From EnterpriseDB BDB page [1]:

> Adding more master nodes to a BDR Group does not result in
> significant write throughput increase when most tables are
> replicated because BDR has to replay all the writes on all nodes.
> Because BDR writes are in general more effective than writes coming
> from Postgres clients via SQL, some performance increase can be
> achieved. Read throughput generally scales linearly with the number
> of nodes.

And I'm sure EnterpriseDB does the best.

> > В Чт, 28/04/2022 в 08:34 +, kuroda.hay...@fujitsu.com пишет:
> > > Dear Laurenz,
> > > 
> > > Thank you for your interest in our works!
> > > 
> > > > I am missing a discussion how replication conflicts are handled to
> > > > prevent replication from breaking
> > > 
> > > Actually we don't have plans for developing the feature that avoids 
> > > conflict.
> > > We think that it should be done as core PUB/SUB feature, and
> > > this module will just use that.
> > 
> > If you really want to have some proper isolation levels (
> > Read Committed? Repeatable Read?) and/or want to have
> > same data on each "master", there is no easy way. If you
> > think it will be "easy", you are already wrong.
> 
> The synchronous_commit and synchronous_standby_names configuration
> parameters will help in getting the same data across the nodes. Can
> you give an example for the scenario where it will be difficult?

So, synchronous or asynchronous?
Synchronous commit on every master, every alive master or on quorum
of masters?

And it is not about synchronicity. It is about determinism at
conflicts.

If you have fully determenistic conflict resolution that works
exactly same way on each host, then it is possible to have same
data on each host. (But it will not be transactional.)And it seems EDB BDB 
achieved this.

Or if you have fully and correctly implemented one of distributed
transactions protocols.

[1]  
https://www.enterprisedb.com/docs/bdr/latest/overview/#characterising-bdr-performance

regards

--

Yura Sokolov





Re: Multi-Master Logical Replication

2022-04-28 Thread Yura Sokolov
В Чт, 28/04/2022 в 09:49 +1000, Peter Smith пишет:

> 1.1 ADVANTAGES OF MMLR
> 
> - Increases write scalability (e.g., all nodes can write arbitrary data).

I've never heard how transactional-aware multimaster increases
write scalability. More over, usually even non-transactional
multimaster doesn't increase write scalability. At the best it
doesn't decrease.

That is because all hosts have to write all changes anyway. But
side cost increases due to increased network interchange and
interlocking (for transaction-aware MM) and increased latency.

В Чт, 28/04/2022 в 08:34 +, kuroda.hay...@fujitsu.com пишет:
> Dear Laurenz,
> 
> Thank you for your interest in our works!
> 
> > I am missing a discussion how replication conflicts are handled to
> > prevent replication from breaking
> 
> Actually we don't have plans for developing the feature that avoids conflict.
> We think that it should be done as core PUB/SUB feature, and
> this module will just use that.

If you really want to have some proper isolation levels (
Read Committed? Repeatable Read?) and/or want to have
same data on each "master", there is no easy way. If you
think it will be "easy", you are already wrong.

Our company has MultiMaster which is built on top of
logical replication. It is even partially open source
( https://github.com/postgrespro/mmts ) , although some
core patches that have to be done for are not up to
date.

And it is second iteration of MM. First iteration were
not "simple" or "easy" already. But even that version had
the hidden bug: rare but accumulating data difference
between nodes. Attempt to fix this bug led to almost
full rewrite of multi-master.

(Disclaimer: I had no relation to both MM versions,
I just work in the same firm).


regards

-

Yura Sokolov





Re: BufferAlloc: don't take two simultaneous locks

2022-04-22 Thread Yura Sokolov
Btw, I've runned tests on EPYC (80 cores).

1 key per select
  conns | master |  patch-v11 |  master 1G | patch-v11 1G 
++++
  1 |  29053 |  28959 |  26715 |  25631 
  2 |  53714 |  53002 |  55211 |  53699 
  3 |  69796 |  72100 |  72355 |  71164 
  5 | 118045 | 112066 | 122182 | 119825 
  7 | 151933 | 156298 | 162001 | 160834 
 17 | 344594 | 347809 | 390103 | 386676 
 27 | 497656 | 527313 | 587806 | 598450 
 53 | 732524 | 853831 | 906569 | 947050 
 83 | 823203 | 991415 |1056884 |1222530 
107 | 812730 | 930175 |1004765 |1232307 
139 | 781757 | 938718 | 995326 |1196653 
163 | 758991 | 969781 | 990644 |1143724 
191 | 774137 | 977633 | 996763 |1210899 
211 | 771856 | 973361 |1024798 |1187824 
239 | 756925 | 940808 | 954326 |1165303 
271 | 756220 | 940508 | 970254 |1198773 
307 | 746784 | 941038 | 940369 |1159446 
353 | 710578 | 928296 | 923437 |1189575 
397 | 715352 | 915931 | 911638 |1180688 

3 keys per select

  conns | master |  patch-v11 |  master 1G | patch-v11 1G 
++++
  1 |  17448 |  17104 |  18359 |  19077 
  2 |  30888 |  31650 |  35074 |  35861 
  3 |  44653 |  43371 |  47814 |  47360 
  5 |  69632 |  64454 |  76695 |  76208 
  7 |  96385 |  92526 | 107587 | 107930 
 17 | 195157 | 205156 | 253440 | 239740 
 27 | 302343 | 316768 | 386748 | 335148 
 53 | 334321 | 396359 | 402506 | 486341 
 83 | 300439 | 374483 | 408694 | 452731 
107 | 302768 | 369207 | 390599 | 453817 
139 | 294783 | 364885 | 379332 | 459884 
163 | 272646 | 344643 | 376629 | 460839 
191 | 282307 | 334016 | 363322 | 449928 
211 | 275123 | 321337 | 371023 | 445246 
239 | 263072 | 341064 | 356720 | 441250 
271 | 271506 | 333066 | 373994 | 436481 
307 | 261545 | 333489 | 348569 | 466673 
353 | 255700 | 331344 | 333792 | 455430 
397 | 247745 | 325712 | 326680 | 439245 




Re: BufferAlloc: don't take two simultaneous locks

2022-04-21 Thread Yura Sokolov
В Чт, 21/04/2022 в 16:24 -0400, Robert Haas пишет:
> On Thu, Apr 21, 2022 at 5:04 AM Yura Sokolov  wrote:
> >   $ pid=`ps x | awk '/checkpointer/ && !/awk/ { print $1 }'`
> >   $ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'
> > 
> >   $1 = 16512
> > 
> >   $ install/bin/pgbench -c 600 -j 800 -T 10 -P 1 -S -M prepared postgres
> >   ...
> >   $ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'
> > 
> >   $1 = 20439
> > 
> >   $ install/bin/pgbench -c 600 -j 800 -T 10 -P 1 -S -M prepared postgres
> >   ...
> >   $ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'
> > 
> >   $1 = 20541
> > 
> > It stabilizes at 20541
> 
> Hmm. So is the existing comment incorrect?

It is correct and incorrect at the same time. Logically it is correct.
And it is correct in practice if HASH_FIXED_SIZE is set for SharedBufHash
(which is not currently). But setting HASH_FIXED_SIZE hurts performance
with low number of spare items.

> Remember, I was complaining
> about this change:
> 
> --- a/src/backend/storage/buffer/freelist.c
> +++ b/src/backend/storage/buffer/freelist.c
> @@ -481,10 +481,10 @@ StrategyInitialize(bool init)
>   *
>   * Since we can't tolerate running out of lookup table entries, we must be
>   * sure to specify an adequate table size here.  The maximum steady-state
> - * usage is of course NBuffers entries, but BufferAlloc() tries to insert
> - * a new entry before deleting the old.  In principle this could be
> - * happening in each partition concurrently, so we could need as many as
> - * NBuffers + NUM_BUFFER_PARTITIONS entries.
> + * usage is of course NBuffers entries. But due to concurrent
> + * access to numerous free lists in dynahash we can miss free entry that
> + * moved between free lists. So it is better to have some spare free entries
> + * to reduce probability of entry allocations after server start.
>   */
>   InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
> 
> Pre-patch, the comment claims that the maximum number of buffer
> entries that can be simultaneously used is limited to NBuffers +
> NUM_BUFFER_PARTITIONS, and that's why we make the hash table that
> size. The idea is that we normally need more than 1 entry per buffer,
> but sometimes we might have 2 entries for the same buffer if we're in
> the process of changing the buffer tag, because we make the new entry
> before removing the old one. To change the buffer tag, we need the
> buffer mapping lock for the old partition and the new one, but if both
> are the same, we need only one buffer mapping lock. That means that in
> the worst case, you could have a number of processes equal to
> NUM_BUFFER_PARTITIONS each in the process of changing the buffer tag
> between values that both fall into the same partition, and thus each
> using 2 entries. Then you could have every other buffer in use and
> thus using 1 entry, for a total of NBuffers + NUM_BUFFER_PARTITIONS
> entries. Now I think you're saying we go far beyond that number, and
> what I wonder is how that's possible. If the system doesn't work the
> way the comment says it does, maybe we ought to start by talking about
> what to do about that.

At the master state:
- SharedBufHash is not declared as HASH_FIXED_SIZE
- get_hash_entry falls back to element_alloc too fast (just if it doesn't
  found free entry in current freelist partition).
- get_hash_entry has races.
- if there are small number of spare items (and NUM_BUFFER_PARTITIONS is
  small number) and HASH_FIXED_SIZE is set, it becomes contended and
  therefore slow.

HASH_REUSE solves (for shared buffers) most of this issues. Free list
became rare fallback, so HASH_FIXED_SIZE for SharedBufHash doesn't lead
to performance hit. And with fair number of spare items, get_hash_entry
will find free entry despite its races.

> I am a bit confused by your description of having done "p
> SharedBufHash->hctl->allocated.value" because SharedBufHash is of type
> HTAB and HTAB's hctl member is of type HASHHDR, which has no field
> called "allocated".

Previous letter contains links to small patches that I used for
experiments. Link that adds "allocated" is https://pastebin.com/c5z0d5mz

>  I thought maybe my analysis here was somehow
> mistaken, so I tried the debugger, which took the same view of it that
> I did:
> 
> (lldb) p SharedBufHash->hctl->allocated.value
> error: :1:22: no member named 'allocated' in 'HASHHDR'
> SharedBufHash->hctl->allocated.value
> ~~~  ^


-

regards

Yura Sokolov





Re: BufferAlloc: don't take two simultaneous locks

2022-04-21 Thread Yura Sokolov
uot;.
It really could be changed to PANIC).

Could LWLockRelease elog(ERROR|FATAL)?
(elog(ERROR, "lock is not held") could not be triggerred since we
certainly hold the lock).

Could LWLockAcquire elog(ERROR|FATAL)?
Well, there is `elog(ERROR, "too many LWLocks taken");`
It is not possible becase we just did LWLockRelease.

Could BufTableInsert elog(ERROR|FATAL)?
There is "out of shared memory" which is avoidable with get_hash_entry
modifications or with HASH_FIXED_SIZE + some spare items.

Could CHECK_FOR_INTERRUPTS raise something?
No: there is single line between LWLockRelease and LWLockAcquire, and
it doesn't contain CHECK_FOR_INTERRUPTS.

Therefore there is single fixable case of "out of shared memory" (by
HASH_FIXED_SIZE or improvements to "get_hash_entry").


May be I'm not quite right at some point. I'd glad to learn.

-

regards

Yura Sokolov





Re: BufferAlloc: don't take two simultaneous locks

2022-04-13 Thread Yura Sokolov
В Пт, 08/04/2022 в 16:46 +0900, Kyotaro Horiguchi пишет:
> At Thu, 07 Apr 2022 14:14:59 +0300, Yura Sokolov  
> wrote in 
> > В Чт, 07/04/2022 в 16:55 +0900, Kyotaro Horiguchi пишет:
> > > Hi, Yura.
> > > 
> > > At Wed, 06 Apr 2022 16:17:28 +0300, Yura Sokolov 
> > >  wrot
> > > e in 
> > > > Ok, I got access to stronger server, did the benchmark, found weird
> > > > things, and so here is new version :-)
> > > 
> > > Thanks for the new version and benchmarking.
> > > 
> > > > First I found if table size is strictly limited to NBuffers and FIXED,
> > > > then under high concurrency get_hash_entry may not find free entry
> > > > despite it must be there. It seems while process scans free lists, other
> > > > concurrent processes "moves etry around", ie one concurrent process
> > > > fetched it from one free list, other process put new entry in other
> > > > freelist, and unfortunate process missed it since it tests freelists
> > > > only once.
> > > 
> > > StrategyGetBuffer believes that entries don't move across freelists
> > > and it was true before this patch.
> > 
> > StrategyGetBuffer knows nothing about dynahash's freelist.
> > It knows about buffer manager's freelist, which is not partitioned.
> 
> Yeah, right. I meant get_hash_entry.

But entries doesn't move.
One backends takes some entry from one freelist, other backend puts
other entry to other freelist. 

> > > I don't think that causes significant performance hit, but I don't
> > > understand how it improves freelist hit ratio other than by accident.
> > > Could you have some reasoning for it?
> > 
> > Since free_reused_entry returns entry into random free_list, this
> > probability is quite high. In tests, I see stabilisa
> 
> Maybe.  Doesn't it improve the efficiency if we prioritize emptied
> freelist on returning an element?  I tried it with an atomic_u32 to
> remember empty freelist. On the uin32, each bit represents a freelist
> index.  I saw it eliminated calls to element_alloc.  I tried to
> remember a single freelist index in an atomic but there was a case
> where two freelists are emptied at once and that lead to element_alloc
> call.

I thought on bitmask too.
But doesn't it return contention which many freelists were against?
Well, in case there are enough entries to keep it almost always "all
set", it would be immutable.

> > > By the way the change of get_hash_entry looks something wrong.
> > > 
> > > If I understand it correctly, it visits num_freelists/4 freelists at
> > > once, then tries element_alloc. If element_alloc() fails (that must
> > > happen), it only tries freeList[freelist_idx] and gives up, even
> > > though there must be an element in other 3/4 freelists.
> > 
> > No. If element_alloc fails, it tries all NUM_FREELISTS again.
> > - condition: `ntries || !allocFailed`. `!allocFailed` become true,
> >   so `ntries` remains.
> > - `ntries = num_freelists;` regardless of `allocFailed`.
> > Therefore, all `NUM_FREELISTS` are retried for partitioned table.
> 
> Ah, okay. ntries is set to num_freelists after calling element_alloc.
> I think we (I?) need more comments.
> 
> By the way, why it is num_freelists / 4 + 1?

Well, num_freelists could be 1 or 32.
If num_freelists is 1 then num_freelists / 4 == 0 - not good :-) 

--

regards

Yura Sokolov





Re: BufferAlloc: don't take two simultaneous locks

2022-04-07 Thread Yura Sokolov
В Чт, 07/04/2022 в 16:55 +0900, Kyotaro Horiguchi пишет:
> Hi, Yura.
> 
> At Wed, 06 Apr 2022 16:17:28 +0300, Yura Sokolov  
> wrot
> e in 
> > Ok, I got access to stronger server, did the benchmark, found weird
> > things, and so here is new version :-)
> 
> Thanks for the new version and benchmarking.
> 
> > First I found if table size is strictly limited to NBuffers and FIXED,
> > then under high concurrency get_hash_entry may not find free entry
> > despite it must be there. It seems while process scans free lists, other
> > concurrent processes "moves etry around", ie one concurrent process
> > fetched it from one free list, other process put new entry in other
> > freelist, and unfortunate process missed it since it tests freelists
> > only once.
> 
> StrategyGetBuffer believes that entries don't move across freelists
> and it was true before this patch.

StrategyGetBuffer knows nothing about dynahash's freelist.
It knows about buffer manager's freelist, which is not partitioned.

> 
> > Second, I confirm there is problem with freelist spreading.
> > If I keep entry's freelist_idx, then one freelist is crowded.
> > If I use new entry's freelist_idx, then one freelist is emptified
> > constantly.
> 
> Perhaps it is what I saw before.  I'm not sure about the details of
> how that happens, though.
> 
> > Third, I found increased concurrency could harm. When popular block
> > is evicted for some reason, then thundering herd effect occures:
> > many backends wants to read same block, they evict many other
> > buffers, but only one is inserted. Other goes to freelist. Evicted
> > buffers by itself reduce cache hit ratio and provocates more
> > work. Old version resists this effect by not removing old buffer
> > before new entry is successfully inserted.
> 
> Nice finding.
> 
> > To fix this issues I made following:
> > 
> > # Concurrency
> > 
> > First, I limit concurrency by introducing other lwlocks tranche -
> > BufferEvict. It is 8 times larger than BufferMapping tranche (1024 vs
> > 128).
> > If backend doesn't find buffer in buffer table and wants to introduce
> > it, it first calls
> > LWLockAcquireOrWait(newEvictPartitionLock, LW_EXCLUSIVE)
> > If lock were acquired, then it goes to eviction and replace process.
> > Otherwise, it waits lock to be released and repeats search.
> >
> > This greately improve performance for > 400 clients in pgbench.
> 
> So the performance difference between the existing code and v11 is the
> latter has a collision cross section eight times smaller than the
> former?

No. Acquiring EvictPartitionLock
1. doesn't block readers, since readers doesn't acquire EvictPartitionLock
2. doesn't form "tree of lock dependency" since EvictPartitionLock is
  independent from PartitionLock.

Problem with existing code:
1. Process A locks P1 and P2
2. Process B (p3-old, p1-new) locks P3 and wants to lock P1
3. Process C (p4-new, p1-old) locks P4 and wants to lock P1
4. Process D (p5-new, p4-old) locks P5 and wants to lock P4
At this moment locks P1, P2, P3, P4 and P5 are all locked and waiting
for Process A.
And readers can't read from same five partitions.

With new code:
1. Process A locks E1 (evict partition) and locks P2,
   then releases P2 and locks P1.
2. Process B tries to locks E1, waits and retries search.
3. Process C locks E4, locks P1, then releases P1 and locks P4
4. Process D locks E5, locks P4, then releases P4 and locks P5
So, there is no network of locks.
Process A doesn't block Process D in any moment:
- either A blocks C, but C doesn't block D at this moment
- or A doesn't block C.
And readers doesn't see simultaneously locked five locks which
depends on single Process A.

> +* Prevent "thundering herd" problem and limit concurrency.
> 
> this is something like pressing accelerator and break pedals at the
> same time.  If it improves performance, just increasing the number of
> buffer partition seems to work?

To be honestly: of cause simple increase of NUM_BUFFER_PARTITIONS
does improve average case.
But it is better to cure problem than anesthetize.
Increase of
NUM_BUFFER_PARTITIONS reduces probability and relative
weight of lock network, but doesn't eliminate.

> It's also not great that follower backends runs a busy loop on the
> lock until the top-runner backend inserts the new buffer to the
> buftable then releases the newParititionLock.
> 
> > I tried other variant as well:
> > - first insert entry with dummy buffer index into buffer table.
> > - if such entry were already here, then wait it to be filled.
> > - otherwise find victim buffer and replace dummy index with new one.
> > Wait w

Re: BufferAlloc: don't take two simultaneous locks

2022-04-06 Thread Yura Sokolov
Good day, Kyotaoro-san.
Good day, hackers.

В Вс, 20/03/2022 в 12:38 +0300, Yura Sokolov пишет:
> В Чт, 17/03/2022 в 12:02 +0900, Kyotaro Horiguchi пишет:
> > At Wed, 16 Mar 2022 14:11:58 +0300, Yura Sokolov  
> > wrote in 
> > > В Ср, 16/03/2022 в 12:07 +0900, Kyotaro Horiguchi пишет:
> > > > At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov 
> > > >  wrote in 
> > > > In v7, HASH_ENTER returns the element stored in DynaHashReuse using
> > > > the freelist_idx of the new key.  v8 uses that of the old key (at the
> > > > time of HASH_REUSE).  So in the case "REUSE->ENTER(elem exists and
> > > > returns the stashed)" case the stashed element is returned to its
> > > > original partition.  But it is not what I mentioned.
> > > > 
> > > > On the other hand, once the stahsed element is reused by HASH_ENTER,
> > > > it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow
> > > > from old partition) case.  I suspect that ththat the frequent freelist
> > > > starvation comes from the latter case.
> > > 
> > > Doubtfully. Due to probabilty theory, single partition doubdfully
> > > will be too overflowed. Therefore, freelist.
> > 
> > Yeah.  I think so generally.
> > 
> > > But! With 128kb shared buffers there is just 32 buffers. 32 entry for
> > > 32 freelist partition - certainly some freelist partition will certainly
> > > have 0 entry even if all entries are in freelists. 
> > 
> > Anyway, it's an extreme condition and the starvation happens only at a
> > neglegible ratio.
> > 
> > > > RETURNED: 2
> > > > ALLOCED: 0
> > > > BORROWED: 435
> > > > REUSED: 495444
> > > > ASSIGNED: 495467 (-23)
> > > > 
> > > > Now "BORROWED" happens 0.8% of REUSED
> > > 
> > > 0.08% actually :)
> > 
> > Mmm.  Doesn't matter:p
> > 
> > > > > > > I lost access to Xeon 8354H, so returned to old Xeon X5675.
> > > > > > ...
> > > > > > > Strange thing: both master and patched version has higher
> > > > > > > peak tps at X5676 at medium connections (17 or 27 clients)
> > > > > > > than in first october version [1]. But lower tps at higher
> > > > > > > connections number (>= 191 clients).
> > > > > > > I'll try to bisect on master this unfortunate change.
> > ...
> > > I've checked. Looks like something had changed on the server, since
> > > old master commit behaves now same to new one (and differently to
> > > how it behaved in October).
> > > I remember maintainance downtime of the server in november/december.
> > > Probably, kernel were upgraded or some system settings were changed.
> > 
> > One thing I have a little concern is that numbers shows 1-2% of
> > degradation steadily for connection numbers < 17.
> > 
> > I think there are two possible cause of the degradation.
> > 
> > 1. Additional branch by consolidating HASH_ASSIGN into HASH_ENTER.
> >   This might cause degradation for memory-contended use.
> > 
> > 2. nallocs operation might cause degradation on non-shared dynahasyes?
> >   I believe doesn't but I'm not sure.
> > 
> >   On a simple benchmarking with pgbench on a laptop, dynahash
> >   allocation (including shared and non-shared) happend about at 50
> >   times per second with 10 processes and 200 with 100 processes.
> > 
> > > > I don't think nalloced needs to be the same width to long.  For the
> > > > platforms with 32-bit long, anyway the possible degradation if any by
> > > > 64-bit atomic there doesn't matter.  So don't we always define the
> > > > atomic as 64bit and use the pg_atomic_* functions directly?
> > > 
> > > Some 32bit platforms has no native 64bit atomics. Then they are
> > > emulated with locks.
> > > 
> > > Well, and for 32bit platform long is just enough. Why spend other
> > > 4 bytes per each dynahash?
> > 
> > I don't think additional bytes doesn't matter, but emulated atomic
> > operations can matter. However I'm not sure which platform uses that
> > fallback implementations.  (x86 seems to have __sync_fetch_and_add()
> > since P4).
> > 
> > My opinion in the previous mail is that if that level of degradation
> > caued by emulated atomic operations matters, we shouldn't use atomic
> > there at all since atomic operations on t

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2022-04-05 Thread Yura Sokolov
Good day, David.

I'm looking on patch and don't get some moments.

`GrantLockLocal` allocates `LOCALLOCKOWNER` and links it into
`locallock->locallockowners`. It links it regardless `owner` could be
NULL. But then `RemoveLocalLock` does `Assert(locallockowner->owner != NULL);`.
Why it should not fail?

`GrantLockLocal` allocates `LOCALLOCKOWNER` in `TopMemoryContext`.
But there is single `pfree(locallockowner)` in `LockReassignOwner`.
Looks like there should be more `pfree`. Shouldn't they?

`GrantLockLocal` does `dlist_push_tail`, but isn't it better to
do `dlist_push_head`? Resource owners usually form stack, so usually
when owner searches for itself it is last added to list.
Then `dlist_foreach` will find it sooner if it were added to the head.

regards

-

Yura Sokolov
Postgres Professional
y.soko...@postgrespro.ru
funny.fal...@gmail.com






Jumble Query with COERCE_SQL_SYNTAX

2022-03-29 Thread Yura Sokolov
Good day.

v14 introduced the way to get original text for some kind of expressions
using new 'funcformat' - COERCE_SQL_SYNTAX:
- EXTRACT(part from timestamp)
- (text IS [form] NORMALIZED)
and others.

Mentioned EXTRACT and NORMALIZED statements has parts, that are not
usual arguments but some kind of syntax. At least, there is no way to:

PREPARE a(text) as select extract($1 from now());

But JumbleExpr doesn't distinguish it and marks this argument as a
variable constant, ie remembers it in 'clocations'.

I believe such "non-variable constant" should not be jumbled as
replaceable thing.

In our case (extended pg_stat_statements), we attempt to generalize
plan and then explain generalized plan. But using constant list from
JumbleState we mistakenly replace first argument in EXTRACT expression
with parameter. And then 'get_func_sql_syntax' fails on assertion "first
argument is text constant".

Sure we could workaround in our plan mutator with skipping such first
argument. But I wonder, is it correct at all to not count it as a
non-modifiable syntax part in JumbleExpr?

--

regards,

Sokolov Yura
y.soko...@postgrespro.ru
funny.fal...@gmail.coma





Re: BufferAlloc: don't take two simultaneous locks

2022-03-20 Thread Yura Sokolov
В Чт, 17/03/2022 в 12:02 +0900, Kyotaro Horiguchi пишет:
> At Wed, 16 Mar 2022 14:11:58 +0300, Yura Sokolov  
> wrote in 
> > В Ср, 16/03/2022 в 12:07 +0900, Kyotaro Horiguchi пишет:
> > > At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov 
> > >  wrote in 
> > > In v7, HASH_ENTER returns the element stored in DynaHashReuse using
> > > the freelist_idx of the new key.  v8 uses that of the old key (at the
> > > time of HASH_REUSE).  So in the case "REUSE->ENTER(elem exists and
> > > returns the stashed)" case the stashed element is returned to its
> > > original partition.  But it is not what I mentioned.
> > > 
> > > On the other hand, once the stahsed element is reused by HASH_ENTER,
> > > it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow
> > > from old partition) case.  I suspect that ththat the frequent freelist
> > > starvation comes from the latter case.
> > 
> > Doubtfully. Due to probabilty theory, single partition doubdfully
> > will be too overflowed. Therefore, freelist.
> 
> Yeah.  I think so generally.
> 
> > But! With 128kb shared buffers there is just 32 buffers. 32 entry for
> > 32 freelist partition - certainly some freelist partition will certainly
> > have 0 entry even if all entries are in freelists. 
> 
> Anyway, it's an extreme condition and the starvation happens only at a
> neglegible ratio.
> 
> > > RETURNED: 2
> > > ALLOCED: 0
> > > BORROWED: 435
> > > REUSED: 495444
> > > ASSIGNED: 495467 (-23)
> > > 
> > > Now "BORROWED" happens 0.8% of REUSED
> > 
> > 0.08% actually :)
> 
> Mmm.  Doesn't matter:p
> 
> > > > > > I lost access to Xeon 8354H, so returned to old Xeon X5675.
> > > > > ...
> > > > > > Strange thing: both master and patched version has higher
> > > > > > peak tps at X5676 at medium connections (17 or 27 clients)
> > > > > > than in first october version [1]. But lower tps at higher
> > > > > > connections number (>= 191 clients).
> > > > > > I'll try to bisect on master this unfortunate change.
> ...
> > I've checked. Looks like something had changed on the server, since
> > old master commit behaves now same to new one (and differently to
> > how it behaved in October).
> > I remember maintainance downtime of the server in november/december.
> > Probably, kernel were upgraded or some system settings were changed.
> 
> One thing I have a little concern is that numbers shows 1-2% of
> degradation steadily for connection numbers < 17.
> 
> I think there are two possible cause of the degradation.
> 
> 1. Additional branch by consolidating HASH_ASSIGN into HASH_ENTER.
>   This might cause degradation for memory-contended use.
> 
> 2. nallocs operation might cause degradation on non-shared dynahasyes?
>   I believe doesn't but I'm not sure.
> 
>   On a simple benchmarking with pgbench on a laptop, dynahash
>   allocation (including shared and non-shared) happend about at 50
>   times per second with 10 processes and 200 with 100 processes.
> 
> > > I don't think nalloced needs to be the same width to long.  For the
> > > platforms with 32-bit long, anyway the possible degradation if any by
> > > 64-bit atomic there doesn't matter.  So don't we always define the
> > > atomic as 64bit and use the pg_atomic_* functions directly?
> > 
> > Some 32bit platforms has no native 64bit atomics. Then they are
> > emulated with locks.
> > 
> > Well, and for 32bit platform long is just enough. Why spend other
> > 4 bytes per each dynahash?
> 
> I don't think additional bytes doesn't matter, but emulated atomic
> operations can matter. However I'm not sure which platform uses that
> fallback implementations.  (x86 seems to have __sync_fetch_and_add()
> since P4).
> 
> My opinion in the previous mail is that if that level of degradation
> caued by emulated atomic operations matters, we shouldn't use atomic
> there at all since atomic operations on the modern platforms are not
> also free.
> 
> In relation to 2 above, if we observe that the degradation disappears
> by (tentatively) use non-atomic operations for nalloced, we should go
> back to the previous per-freelist nalloced.

Here is version with nalloced being union of appropriate atomic and
long.

--

regards
Yura Sokolov
From 68800f6f02f062320e6d9fe42c986809a06a37cb Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH 1/4] [PGPRO-5616] bufmgr: do not acquire t

Declare PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY for aarch64

2022-03-16 Thread Yura Sokolov
Good day, hackers.

Architecture Reference Manual for ARMv8 B2.2.1 [1] states:

  For explicit memory effects generated from an Exception level the
  following rules apply:
  - A read that is generated by a load instruction that loads a single
  general-purpose register and is aligned to the size of the read in the
  instruction is single-copy atomic.
  - A write that is generated by a store instruction that stores a single
  general-purpose register and is aligned to the size of the write in the
  instruction is single-copy atomic.

So I believe it is safe to define PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
for aarch64

[1] https://documentation-service.arm.com/static/61fbe8f4fa8173727a1b734e
https://developer.arm.com/documentation/ddi0487/latest

---

regards

Yura Sokolov
Postgres Professional
y.soko...@postgrespro.ru
funny.fal...@gmail.com
From b61b065acba570f1f935ff6ca17dc687c45db2f2 Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Wed, 16 Mar 2022 15:12:07 +0300
Subject: [PATCH v0] Declare aarch64 has single copy atomicity for 8 byte
 values.

Architecture Reference Manual for ARMv8 B2.2.1 [1]

For explicit memory effects generated from an Exception level the
following rules apply:
- A read that is generated by a load instruction that loads a single
general-purpose register and is aligned to the size of the read in the
instruction is single-copy atomic.
- A write that is generated by a store instruction that stores a single
general-purpose register and is aligned to the size of the write in the
instruction is single-copy atomic.

[1] https://documentation-service.arm.com/static/61fbe8f4fa8173727a1b734e
https://developer.arm.com/documentation/ddi0487/latest
---
 src/include/port/atomics/arch-arm.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/src/include/port/atomics/arch-arm.h b/src/include/port/atomics/arch-arm.h
index 2083e3230db..9fe8f1b95f7 100644
--- a/src/include/port/atomics/arch-arm.h
+++ b/src/include/port/atomics/arch-arm.h
@@ -23,4 +23,10 @@
  */
 #if !defined(__aarch64__) && !defined(__aarch64)
 #define PG_DISABLE_64_BIT_ATOMICS
+#else
+/*
+ * Architecture Reference Manual for ARMv8 states aligned read/write to/from
+ * general purpose register is atomic.
+ */
+#define PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
 #endif  /* __aarch64__ || __aarch64 */
-- 
2.35.1



Re: BufferAlloc: don't take two simultaneous locks

2022-03-16 Thread Yura Sokolov
В Ср, 16/03/2022 в 12:07 +0900, Kyotaro Horiguchi пишет:
> At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov  
> wrote in 
> > В Вт, 15/03/2022 в 16:25 +0900, Kyotaro Horiguchi пишет:
> > > Hmm. v8 returns stashed element with original patition index when the
> > > element is *not* reused.  But what I saw in the previous test runs is
> > > the REUSE->ENTER(reuse)(->REMOVE) case.  So the new version looks like
> > > behaving the same way (or somehow even worse) with the previous
> > > version.
> > 
> > v8 doesn't differ in REMOVE case neither from master nor from
> > previous version. It differs in RETURNED case only.
> > Or I didn't understand what you mean :(
> 
> In v7, HASH_ENTER returns the element stored in DynaHashReuse using
> the freelist_idx of the new key.  v8 uses that of the old key (at the
> time of HASH_REUSE).  So in the case "REUSE->ENTER(elem exists and
> returns the stashed)" case the stashed element is returned to its
> original partition.  But it is not what I mentioned.
> 
> On the other hand, once the stahsed element is reused by HASH_ENTER,
> it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow
> from old partition) case.  I suspect that ththat the frequent freelist
> starvation comes from the latter case.

Doubtfully. Due to probabilty theory, single partition doubdfully
will be too overflowed. Therefore, freelist.

But! With 128kb shared buffers there is just 32 buffers. 32 entry for
32 freelist partition - certainly some freelist partition will certainly
have 0 entry even if all entries are in freelists. 

> > > get_hash_entry continuously suffer lack of freelist
> > > entry. (FWIW, attached are the test-output diff for both master and
> > > patched)
> > > 
> > > master finally allocated 31 fresh elements for a 100s run.
> > > 
> > > > ALLOCED: 31;; freshly allocated
> > > 
> > > v8 finally borrowed 33620 times from another freelist and 0 freshly
> > > allocated (ah, this version changes that..)
> > > Finally v8 results in:
> > > 
> > > > RETURNED: 50806;; returned stashed elements
> > > > BORROWED: 33620;; borrowed from another freelist
> > > > REUSED: 1812664;; stashed
> > > > ASSIGNED: 1762377  ;; reused
> > > >(ALLOCED: 0);; freshly allocated
> 
> (I misunderstand that v8 modified get_hash_entry's preference between
> allocation and borrowing.)
> 
> I re-ran the same check for v7 and it showed different result.
> 
> RETURNED: 1
> ALLOCED: 15
> BORROWED: 0
> REUSED: 505435
> ASSIGNED: 505462 (-27)  ## the counters are not locked.
> 
> > Is there any measurable performance hit cause of borrowing?
> > Looks like "borrowed" happened in 1.5% of time. And it is on 128kb
> > shared buffers that is extremely small. (Or it was 128MB?)
> 
> It is intentional set small to get extremely frequent buffer
> replacements.  The point here was the patch actually can induce
> frequent freelist starvation.  And as you do, I also doubt the
> significance of the performance hit by that.  Just I was not usre.
>
> I re-ran the same for v8 and got a result largely different from the
> previous trial on the same v8.
> 
> RETURNED: 2
> ALLOCED: 0
> BORROWED: 435
> REUSED: 495444
> ASSIGNED: 495467 (-23)
> 
> Now "BORROWED" happens 0.8% of REUSED

0.08% actually :)

> 
> > Well, I think some spare entries could reduce borrowing if there is
> > a need. I'll test on 128MB with spare entries. If there is profit,
> > I'll return some, but will keep SharedBufHash fixed.
> 
> I don't doubt the benefit of this patch.  And now convinced by myself
> that the downside is negligible than the benefit.
> 
> > Master branch does less freelist manipulations since it  tries to
> > insert first and if there is collision it doesn't delete victim
> > buffer.
> > 
> > > > I lost access to Xeon 8354H, so returned to old Xeon X5675.
> > > ...
> > > > Strange thing: both master and patched version has higher
> > > > peak tps at X5676 at medium connections (17 or 27 clients)
> > > > than in first october version [1]. But lower tps at higher
> > > > connections number (>= 191 clients).
> > > > I'll try to bisect on master this unfortunate change.
> > > 
> > > The reversing of the preference order between freshly-allocation and
> > > borrow-from-another-freelist might affect.
> > 
> > `master` changed its behaviour as well.
> > It is not problem of the patch

Re: BufferAlloc: don't take two simultaneous locks

2022-03-15 Thread Yura Sokolov
В Вт, 15/03/2022 в 13:47 +0300, Yura Sokolov пишет:
> В Вт, 15/03/2022 в 16:25 +0900, Kyotaro Horiguchi пишет:
> > Thanks for the new version.
> > 
> > At Tue, 15 Mar 2022 08:07:39 +0300, Yura Sokolov  
> > wrote in 
> > > В Пн, 14/03/2022 в 14:57 +0300, Yura Sokolov пишет:
> > > > В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет:
> > > > > At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov 
> > > > >  wrote in 
> > > > > > В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:
> > > > > I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
> > > > > 128kB shared buffers and I saw that get_hash_entry never takes the
> > > > > !element_alloc() path and always allocate a fresh entry, then
> > > > > saturates at 30 new elements allocated at the medium of a 100 seconds
> > > > > run.
> > > > > 
> > > > > Then, I tried the same with the patch, and I am surprized to see that
> > > > > the rise of the number of newly allocated elements didn't stop and
> > > > > went up to 511 elements after the 100 seconds run.  So I found that my
> > > > > concern was valid.  The change in dynahash actually
> > > > > continuously/repeatedly causes lack of free list entries.  I'm not
> > > > > sure how much the impact given on performance if we change
> > > > > get_hash_entry to prefer other freelists, though.
> > > > 
> > > > Well, it is quite strange SharedBufHash is not allocated as
> > > > HASH_FIXED_SIZE. Could you check what happens with this flag set?
> > > > I'll try as well.
> > > > 
> > > > Other way to reduce observed case is to remember freelist_idx for
> > > > reused entry. I didn't believe it matters much since entries migrated
> > > > netherless, but probably due to some hot buffers there are tention to
> > > > crowd particular freelist.
> > > 
> > > Well, I did both. Everything looks ok.
> > 
> > Hmm. v8 returns stashed element with original patition index when the
> > element is *not* reused.  But what I saw in the previous test runs is
> > the REUSE->ENTER(reuse)(->REMOVE) case.  So the new version looks like
> > behaving the same way (or somehow even worse) with the previous
> > version.
> 
> v8 doesn't differ in REMOVE case neither from master nor from
> previous version. It differs in RETURNED case only.
> Or I didn't understand what you mean :(
> 
> > get_hash_entry continuously suffer lack of freelist
> > entry. (FWIW, attached are the test-output diff for both master and
> > patched)
> > 
> > master finally allocated 31 fresh elements for a 100s run.
> > 
> > > ALLOCED: 31;; freshly allocated
> > 
> > v8 finally borrowed 33620 times from another freelist and 0 freshly
> > allocated (ah, this version changes that..)
> > Finally v8 results in:
> > 
> > > RETURNED: 50806;; returned stashed elements
> > > BORROWED: 33620;; borrowed from another freelist
> > > REUSED: 1812664;; stashed
> > > ASSIGNED: 1762377  ;; reused
> > > (ALLOCED: 0);; freshly allocated
> > 
> > It contains a huge degradation by frequent elog's so they cannot be
> > naively relied on, but it should show what is happening sufficiently.
> 
> Is there any measurable performance hit cause of borrowing?
> Looks like "borrowed" happened in 1.5% of time. And it is on 128kb
> shared buffers that is extremely small. (Or it was 128MB?)
> 
> Well, I think some spare entries could reduce borrowing if there is
> a need. I'll test on 128MB with spare entries. If there is profit,
> I'll return some, but will keep SharedBufHash fixed.

Well, I added GetMaxBackends spare items, but I don't see certain
profit. It is probably a bit better at 128MB shared buffers and
probably a bit worse at 1GB shared buffers (select_only on scale 100).

But it is on old Xeon X5675. Probably things will change on more
capable hardware. I just don't have access at the moment.

> 
> Master branch does less freelist manipulations since it  tries to
> insert first and if there is collision it doesn't delete victim
> buffer.
> 

-

regards
Yura





Re: BufferAlloc: don't take two simultaneous locks

2022-03-15 Thread Yura Sokolov
В Вт, 15/03/2022 в 13:47 +0300, Yura Sokolov пишет:
> В Вт, 15/03/2022 в 16:25 +0900, Kyotaro Horiguchi пишет:
> > > I lost access to Xeon 8354H, so returned to old Xeon X5675.
> > ...
> > > Strange thing: both master and patched version has higher
> > > peak tps at X5676 at medium connections (17 or 27 clients)
> > > than in first october version [1]. But lower tps at higher
> > > connections number (>= 191 clients).
> > > I'll try to bisect on master this unfortunate change.
> > 
> > The reversing of the preference order between freshly-allocation and
> > borrow-from-another-freelist might affect.
> 
> `master` changed its behaviour as well.
> It is not problem of the patch at all.

Looks like there is no issue: old commmit 2d44dee0281a1abf
behaves similar to new one at the moment.

I think, something changed in environment.
I remember there were maintanance downtime in the autumn.
Perhaps kernel were updated or some sysctl tuning changed.



regards
Yura.





Re: BufferAlloc: don't take two simultaneous locks

2022-03-15 Thread Yura Sokolov
В Вт, 15/03/2022 в 16:25 +0900, Kyotaro Horiguchi пишет:
> Thanks for the new version.
> 
> At Tue, 15 Mar 2022 08:07:39 +0300, Yura Sokolov  
> wrote in 
> > В Пн, 14/03/2022 в 14:57 +0300, Yura Sokolov пишет:
> > > В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет:
> > > > At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov 
> > > >  wrote in 
> > > > > В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:
> > > > I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
> > > > 128kB shared buffers and I saw that get_hash_entry never takes the
> > > > !element_alloc() path and always allocate a fresh entry, then
> > > > saturates at 30 new elements allocated at the medium of a 100 seconds
> > > > run.
> > > > 
> > > > Then, I tried the same with the patch, and I am surprized to see that
> > > > the rise of the number of newly allocated elements didn't stop and
> > > > went up to 511 elements after the 100 seconds run.  So I found that my
> > > > concern was valid.  The change in dynahash actually
> > > > continuously/repeatedly causes lack of free list entries.  I'm not
> > > > sure how much the impact given on performance if we change
> > > > get_hash_entry to prefer other freelists, though.
> > > 
> > > Well, it is quite strange SharedBufHash is not allocated as
> > > HASH_FIXED_SIZE. Could you check what happens with this flag set?
> > > I'll try as well.
> > > 
> > > Other way to reduce observed case is to remember freelist_idx for
> > > reused entry. I didn't believe it matters much since entries migrated
> > > netherless, but probably due to some hot buffers there are tention to
> > > crowd particular freelist.
> > 
> > Well, I did both. Everything looks ok.
> 
> Hmm. v8 returns stashed element with original patition index when the
> element is *not* reused.  But what I saw in the previous test runs is
> the REUSE->ENTER(reuse)(->REMOVE) case.  So the new version looks like
> behaving the same way (or somehow even worse) with the previous
> version.

v8 doesn't differ in REMOVE case neither from master nor from
previous version. It differs in RETURNED case only.
Or I didn't understand what you mean :(

> get_hash_entry continuously suffer lack of freelist
> entry. (FWIW, attached are the test-output diff for both master and
> patched)
> 
> master finally allocated 31 fresh elements for a 100s run.
> 
> > ALLOCED: 31;; freshly allocated
> 
> v8 finally borrowed 33620 times from another freelist and 0 freshly
> allocated (ah, this version changes that..)
> Finally v8 results in:
> 
> > RETURNED: 50806;; returned stashed elements
> > BORROWED: 33620;; borrowed from another freelist
> > REUSED: 1812664;; stashed
> > ASSIGNED: 1762377  ;; reused
> >(ALLOCED: 0);; freshly allocated
> 
> It contains a huge degradation by frequent elog's so they cannot be
> naively relied on, but it should show what is happening sufficiently.

Is there any measurable performance hit cause of borrowing?
Looks like "borrowed" happened in 1.5% of time. And it is on 128kb
shared buffers that is extremely small. (Or it was 128MB?)

Well, I think some spare entries could reduce borrowing if there is
a need. I'll test on 128MB with spare entries. If there is profit,
I'll return some, but will keep SharedBufHash fixed.

Master branch does less freelist manipulations since it  tries to
insert first and if there is collision it doesn't delete victim
buffer.

> > I lost access to Xeon 8354H, so returned to old Xeon X5675.
> ...
> > Strange thing: both master and patched version has higher
> > peak tps at X5676 at medium connections (17 or 27 clients)
> > than in first october version [1]. But lower tps at higher
> > connections number (>= 191 clients).
> > I'll try to bisect on master this unfortunate change.
> 
> The reversing of the preference order between freshly-allocation and
> borrow-from-another-freelist might affect.

`master` changed its behaviour as well.
It is not problem of the patch at all.

--

regards
Yura.





Re: BufferAlloc: don't take two simultaneous locks

2022-03-14 Thread Yura Sokolov
В Пн, 14/03/2022 в 14:57 +0300, Yura Sokolov пишет:
> В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет:
> > At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov  
> > wrote in 
> > > В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:
> > > > I'd like to ask you to remove nalloced from partitions then add a
> > > > global atomic for the same use?
> > > 
> > > I really believe it should be global. I made it per-partition to
> > > not overcomplicate first versions. Glad you tell it.
> > > 
> > > I thought to protect it with freeList[0].mutex, but probably atomic
> > > is better idea here. But which atomic to chose: uint64 or uint32?
> > > Based on sizeof(long)?
> > > Ok, I'll do in next version.
> > 
> > Current nentries is a long (= int64 on CentOS). And uint32 can support
> > roughly 2^32 * 8192 = 32TB shared buffers, which doesn't seem safe
> > enough.  So it would be uint64.
> > 
> > > Whole get_hash_entry look strange.
> > > Doesn't it better to cycle through partitions and only then go to
> > > get_hash_entry?
> > > May be there should be bitmap for non-empty free lists? 32bit for
> > > 32 partitions. But wouldn't bitmap became contention point itself?
> > 
> > The code puts significance on avoiding contention caused by visiting
> > freelists of other partitions.  And perhaps thinks that freelist
> > shortage rarely happen.
> > 
> > I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
> > 128kB shared buffers and I saw that get_hash_entry never takes the
> > !element_alloc() path and always allocate a fresh entry, then
> > saturates at 30 new elements allocated at the medium of a 100 seconds
> > run.
> > 
> > Then, I tried the same with the patch, and I am surprized to see that
> > the rise of the number of newly allocated elements didn't stop and
> > went up to 511 elements after the 100 seconds run.  So I found that my
> > concern was valid.  The change in dynahash actually
> > continuously/repeatedly causes lack of free list entries.  I'm not
> > sure how much the impact given on performance if we change
> > get_hash_entry to prefer other freelists, though.
> 
> Well, it is quite strange SharedBufHash is not allocated as
> HASH_FIXED_SIZE. Could you check what happens with this flag set?
> I'll try as well.
> 
> Other way to reduce observed case is to remember freelist_idx for
> reused entry. I didn't believe it matters much since entries migrated
> netherless, but probably due to some hot buffers there are tention to
> crowd particular freelist.

Well, I did both. Everything looks ok.

> > By the way, there's the following comment in StrategyInitalize.
> > 
> > >* Initialize the shared buffer lookup hashtable.
> > >*
> > >* Since we can't tolerate running out of lookup table entries, we 
> > > must be
> > >* sure to specify an adequate table size here.  The maximum 
> > > steady-state
> > >* usage is of course NBuffers entries, but BufferAlloc() tries to 
> > > insert
> > >* a new entry before deleting the old.  In principle this could be
> > >* happening in each partition concurrently, so we could need as 
> > > many as
> > >* NBuffers + NUM_BUFFER_PARTITIONS entries.
> > >*/
> > >   InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
> > 
> > "but BufferAlloc() tries to insert a new entry before deleting the
> > old." gets false by this patch but still need that additional room for
> > stashed entries.  It seems like needing a fix.

Removed whole paragraph because fixed table without extra entries works
just fine.

I lost access to Xeon 8354H, so returned to old Xeon X5675.

128MB and 1GB shared buffers
pgbench with scale 100
select_only benchmark, unix sockets.

Notebook i7-1165G7:


  conns | master | v8 |  master 1G |  v8 1G 
++++
  1 |  29614 |  29285 |  32413 |  32784 
  2 |  58541 |  60052 |  65851 |  65938 
  3 |  91126 |  90185 | 101404 | 101956 
  5 | 135809 | 133670 | 143783 | 143471 
  7 | 155547 | 153568 | 162566 | 162361 
 17 | 221794 | 218143 | 250562 | 250136 
 27 | 213742 | 211226 | 241806 | 242594 
 53 | 216067 | 214792 | 245868 | 246269 
 83 | 216610 | 218261 | 246798 | 250515 
107 | 216169 | 216656 | 248424 | 250105 
13

Re: BufferAlloc: don't take two simultaneous locks

2022-03-14 Thread Yura Sokolov
В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет:
> At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov  
> wrote in 
> > В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:
> > > I'd like to ask you to remove nalloced from partitions then add a
> > > global atomic for the same use?
> > 
> > I really believe it should be global. I made it per-partition to
> > not overcomplicate first versions. Glad you tell it.
> > 
> > I thought to protect it with freeList[0].mutex, but probably atomic
> > is better idea here. But which atomic to chose: uint64 or uint32?
> > Based on sizeof(long)?
> > Ok, I'll do in next version.
> 
> Current nentries is a long (= int64 on CentOS). And uint32 can support
> roughly 2^32 * 8192 = 32TB shared buffers, which doesn't seem safe
> enough.  So it would be uint64.
> 
> > Whole get_hash_entry look strange.
> > Doesn't it better to cycle through partitions and only then go to
> > get_hash_entry?
> > May be there should be bitmap for non-empty free lists? 32bit for
> > 32 partitions. But wouldn't bitmap became contention point itself?
> 
> The code puts significance on avoiding contention caused by visiting
> freelists of other partitions.  And perhaps thinks that freelist
> shortage rarely happen.
> 
> I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
> 128kB shared buffers and I saw that get_hash_entry never takes the
> !element_alloc() path and always allocate a fresh entry, then
> saturates at 30 new elements allocated at the medium of a 100 seconds
> run.
> 
> Then, I tried the same with the patch, and I am surprized to see that
> the rise of the number of newly allocated elements didn't stop and
> went up to 511 elements after the 100 seconds run.  So I found that my
> concern was valid.  The change in dynahash actually
> continuously/repeatedly causes lack of free list entries.  I'm not
> sure how much the impact given on performance if we change
> get_hash_entry to prefer other freelists, though.

Well, it is quite strange SharedBufHash is not allocated as
HASH_FIXED_SIZE. Could you check what happens with this flag set?
I'll try as well.

Other way to reduce observed case is to remember freelist_idx for
reused entry. I didn't believe it matters much since entries migrated
netherless, but probably due to some hot buffers there are tention to
crowd particular freelist.

> By the way, there's the following comment in StrategyInitalize.
> 
> >* Initialize the shared buffer lookup hashtable.
> >*
> >* Since we can't tolerate running out of lookup table entries, we 
> > must be
> >* sure to specify an adequate table size here.  The maximum 
> > steady-state
> >* usage is of course NBuffers entries, but BufferAlloc() tries to 
> > insert
> >* a new entry before deleting the old.  In principle this could be
> >* happening in each partition concurrently, so we could need as many 
> > as
> >* NBuffers + NUM_BUFFER_PARTITIONS entries.
> >*/
> >   InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
> 
> "but BufferAlloc() tries to insert a new entry before deleting the
> old." gets false by this patch but still need that additional room for
> stashed entries.  It seems like needing a fix.
> 
> 
> 
> regards.
> 
> -- 
> Kyotaro Horiguchi
> NTT Open Source Software Center





Re: BufferAlloc: don't take two simultaneous locks

2022-03-14 Thread Yura Sokolov
В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:
> At Mon, 14 Mar 2022 09:39:48 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > I'll examine the possibility to resolve this...
> 
> The existence of nfree and nalloc made me confused and I found the
> reason.
> 
> In the case where a parittion collects many REUSE-ASSIGN-REMOVEed
> elemetns from other paritiotns, nfree gets larger than nalloced.  This
> is a strange point of the two counters.  nalloced is only referred to
> as (sum(nalloced[])).  So we don't need nalloced per-partition basis
> and the formula to calculate the number of used elements would be as
> follows.
> 
>  sum(nalloced - nfree)
>  =  - sum(nfree)
> 
> We rarely create fresh elements in shared hashes so I don't think
> there's additional contention on the  even if it were
> a global atomic.
> 
> So, the remaining issue is the possible imbalancement among
> partitions.  On second thought, by the current way, if there's a bad
> deviation in partition-usage, a heavily hit partition finally collects
> elements via get_hash_entry().  By the patch's way, similar thing
> happens via the REUSE-ASSIGN-REMOVE sequence. But buffers once used
> for something won't be freed until buffer invalidation. But bulk
> buffer invalidation won't deviatedly distribute freed buffers among
> partitions.  So I conclude for now that is a non-issue.
> 
> So my opinion on the counters is:
> 
> I'd like to ask you to remove nalloced from partitions then add a
> global atomic for the same use?

I really believe it should be global. I made it per-partition to
not overcomplicate first versions. Glad you tell it.

I thought to protect it with freeList[0].mutex, but probably atomic
is better idea here. But which atomic to chose: uint64 or uint32?
Based on sizeof(long)?
Ok, I'll do in next version.

Whole get_hash_entry look strange.
Doesn't it better to cycle through partitions and only then go to
get_hash_entry?
May be there should be bitmap for non-empty free lists? 32bit for
32 partitions. But wouldn't bitmap became contention point itself?

> No need to do something for the possible deviation issue.

---

regards
Yura Sokolov





Re: BufferAlloc: don't take two simultaneous locks

2022-03-13 Thread Yura Sokolov
В Вс, 13/03/2022 в 07:05 -0700, Zhihong Yu пишет:
> 
> Hi,
> In the description:
> 
> There is no need to hold both lock simultaneously. 
> 
> both lock -> both locks

Thanks.

> +* We also reset the usage_count since any recency of use of the old
> 
> recency of use -> recent use

Thanks.

> +BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)
> 
> Later on, there is code:
> 
> +   reuse ? HASH_REUSE : HASH_REMOVE,
> 
> Can flag (such as HASH_REUSE) be passed to BufTableDelete() instead of bool ? 
> That way, flag can be used directly in the above place.

No.
BufTable* functions are created to abstract Buffer Table from dynahash.
Pass of HASH_REUSE directly will break abstraction.

> +   longnalloced;   /* number of entries initially allocated for
> 
> nallocated isn't very long. I think it would be better to name the field 
> nallocated 'nallocated'.

It is debatable.
Why not num_allocated? allocated_count? number_of_allocations?
Same points for nfree.
`nalloced` is recognizable and unambiguous. And there are a lot
of `*alloced` in the postgresql's source, so this one will not
be unusual.

I don't see the need to make it longer.

But if someone supports your point, I will not mind to changing
the name.

> +   sum += hashp->hctl->freeList[i].nalloced;
> +   sum -= hashp->hctl->freeList[i].nfree;
> 
> I think it would be better to calculate the difference between nalloced and 
> nfree first, then add the result to sum (to avoid overflow).

Doesn't really matter much, because calculation must be valid
even if all nfree==0.

I'd rather debate use of 'long' in dynahash at all: 'long' is
32bit on 64bit Windows. It is better to use 'Size' here.

But 'nelements' were 'long', so I didn't change things. I think
it is place for another patch.

(On the other hand, dynahash with 2**31 elements is at least
512GB RAM... we doubtfully trigger problem before OOM killer
came. Does Windows have an OOM killer?)

> Subject: [PATCH 3/3] reduce memory allocation for non-partitioned dynahash
> 
> memory allocation -> memory allocations

For each dynahash instance single allocation were reduced.
I think, 'memory allocation' is correct.

Plural will be
reduce memory allocations for non-partitioned dynahashes
ie both 'allocations' and 'dynahashes'.
Am I wrong?


--

regards
Yura Sokolov
From 68800f6f02f062320e6d9fe42c986809a06a37cb Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH 1/3] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both locks simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.
---
 src/backend/storage/buffer/bufmgr.c | 198 ++--
 1 file changed, 96 insertions(+), 102 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f89..f7dbfc90aaa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1275,8 +1275,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		}
 
 		/*
-		 * To change the association of a valid buffer, we'll need to have
-		 * exclusive lock on both the old and new mapping partitions.
+		 * To change the association of a valid buffer, we'll need to reset
+		 * tag first, so we need to have exclusive lock on the old mapping
+		 * partitions.
 		 */
 		if (oldFlags & BM_TAG_VALID)
 		{
@@ -1289,93 +1290,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode();
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-/* only one partition, only one lock */
-LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to 

Re: BufferAlloc: don't take two simultaneous locks

2022-03-13 Thread Yura Sokolov
В Пт, 11/03/2022 в 17:21 +0900, Kyotaro Horiguchi пишет:
> At Fri, 11 Mar 2022 15:49:49 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi 
> >  wrote in 
> > > Thanks!  I looked into dynahash part.
> > > 
> > >  struct HASHHDR
> > >  {
> > > -   /*
> > > -* The freelist can become a point of contention in 
> > > high-concurrency hash
> > > 
> > > Why did you move around the freeList?

This way it is possible to allocate just first partition, not all 32 partitions.

> 
> Then I looked into bufmgr part.  It looks fine to me but I have some
> comments on code comments.
> 
> >* To change the association of a valid buffer, we'll need to 
> > have
> >* exclusive lock on both the old and new mapping partitions.
> >   if (oldFlags & BM_TAG_VALID)
> 
> We don't take lock on the new mapping partition here.

Thx, fixed.

> +* Clear out the buffer's tag and flags.  We must do this to ensure 
> that
> +* linear scans of the buffer array don't think the buffer is valid. 
> We
> +* also reset the usage_count since any recency of use of the old 
> content
> +* is no longer relevant.
> +*
> +* We are single pinner, we hold buffer header lock and exclusive
> +* partition lock (if tag is valid). Given these statements it is 
> safe to
> +* clear tag since no other process can inspect it to the moment.
> 
> This comment is a merger of the comments from InvalidateBuffer and
> BufferAlloc.  But I think what we need to explain here is why we
> invalidate the buffer here despite of we are going to reuse it soon.
> And I think we need to state that the old buffer is now safe to use
> for the new tag here.  I'm not sure the statement is really correct
> but clearing-out actually looks like safer.

I've tried to reformulate the comment block.

> 
> > Now it is safe to use victim buffer for new tag.  Invalidate the
> > buffer before releasing header lock to ensure that linear scans of
> > the buffer array don't think the buffer is valid.  It is safe
> > because it is guaranteed that we're the single pinner of the buffer.
> > That pin also prevents the buffer from being stolen by others until
> > we reuse it or return it to freelist.
> 
> So I want to revise the following comment.
> 
> -* Now it is safe to use victim buffer for new tag.
> +* Now reuse victim buffer for new tag.
> >* Make sure BM_PERMANENT is set for buffers that must be written at 
> > every
> >* checkpoint.  Unlogged buffers only need to be written at shutdown
> >* checkpoints, except for their "init" forks, which need to be 
> > treated
> >* just like permanent relations.
> >*
> >* The usage_count starts out at 1 so that the buffer can survive one
> >* clock-sweep pass.
> 
> But if you think the current commet is fine, I don't insist on the
> comment chagnes.

Used suggestion.

Fr, 11/03/22 Yura Sokolov wrote:
> В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет:
> > BufTableDelete considers both reuse and !reuse cases but
> > BufTableInsert doesn't and always does HASH_ASSIGN.  That looks
> > odd. We should use HASH_ENTER here.  Thus I think it is more
> > reasonable that HASH_ENTRY uses the stashed entry if exists and
> > needed, or returns it to freelist if exists but not needed.
> > 
> > What do you think about this?
> 
> Well... I don't like it but I don't mind either.
> 
> Code in HASH_ENTER and HASH_ASSIGN cases differs much.
> On the other hand, probably it is possible to merge it carefuly.
> I'll try.

I've merged HASH_ASSIGN into HASH_ENTER.

As in previous letter, three commits are concatted to one file
and could be applied with `git am`.

---

regards

Yura Sokolov
Postgres Professional
y.soko...@postgrespro.ru
funny.fal...@gmail.com
From fbec0dd7d9f11aeaeb8f141ad3dedab7178aeb2e Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH 1/3] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both lock simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.
---
 src/backend/storage/buffer/bufmgr.c | 198 ++--
 1 file changed, 96

Re: BufferAlloc: don't take two simultaneous locks

2022-03-11 Thread Yura Sokolov
В Пт, 11/03/2022 в 15:49 +0900, Kyotaro Horiguchi пишет:
> At Fri, 11 Mar 2022 15:30:30 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > Thanks!  I looked into dynahash part.
> > 
> >  struct HASHHDR
> >  {
> > - /*
> > -  * The freelist can become a point of contention in high-concurrency 
> > hash
> > 
> > Why did you move around the freeList?
> > 
> > 
> > - longnentries;   /* number of entries in 
> > associated buckets */
> > + longnfree;  /* number of free entries in 
> > the list */
> > + longnalloced;   /* number of entries 
> > initially allocated for
> > 
> > Why do we need nfree?  HASH_ASSING should do the same thing with
> > HASH_REMOVE.  Maybe the reason is the code tries to put the detached
> > bucket to different free list, but we can just remember the
> > freelist_idx for the detached bucket as we do for hashp.  I think that
> > should largely reduce the footprint of this patch.
> > 
> > -static void hdefault(HTAB *hashp);
> > +static void hdefault(HTAB *hashp, bool partitioned);
> > 
> > That optimization may work even a bit, but it is not irrelevant to
> > this patch?

(forgot to answer in previous letter).
Yes, third commit is very optional. But adding `nalloced` to
`FreeListData` increases allocation a lot even for usual
non-shared non-partitioned dynahashes. And this allocation is
quite huge right now for no meaningful reason.

> > 
> > + case HASH_REUSE:
> > + if (currBucket != NULL)
> > + {
> > + /* check there is no unfinished 
> > HASH_REUSE+HASH_ASSIGN pair */
> > + Assert(DynaHashReuse.hashp == NULL);
> > + Assert(DynaHashReuse.element == NULL);
> > 
> > I think all cases in the switch(action) other than HASH_ASSIGN needs
> > this assertion and no need for checking both, maybe only for element
> > would be enough.
> 
> While I looked buf_table part, I came up with additional comments.
> 
> BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
> {
> hash_search_with_hash_value(SharedBufHash,
> 
> HASH_ASSIGN,
> ...
> BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)
> 
> BufTableDelete considers both reuse and !reuse cases but
> BufTableInsert doesn't and always does HASH_ASSIGN.  That looks
> odd. We should use HASH_ENTER here.  Thus I think it is more
> reasonable that HASH_ENTRY uses the stashed entry if exists and
> needed, or returns it to freelist if exists but not needed.
> 
> What do you think about this?

Well... I don't like it but I don't mind either.

Code in HASH_ENTER and HASH_ASSIGN cases differs much.
On the other hand, probably it is possible to merge it carefuly.
I'll try.

-

regards

Yura Sokolov





Re: BufferAlloc: don't take two simultaneous locks

2022-03-11 Thread Yura Sokolov
В Пт, 11/03/2022 в 15:30 +0900, Kyotaro Horiguchi пишет:
> At Thu, 03 Mar 2022 01:35:57 +0300, Yura Sokolov  
> wrote in 
> > В Вт, 01/03/2022 в 10:24 +0300, Yura Sokolov пишет:
> > > Ok, here is v4.
> > 
> > And here is v5.
> > 
> > First, there was compilation error in Assert in dynahash.c .
> > Excuse me for not checking before sending previous version.
> > 
> > Second, I add third commit that reduces HASHHDR allocation
> > size for non-partitioned dynahash:
> > - moved freeList to last position
> > - alloc and memset offset(HASHHDR, freeList[1]) for
> >   non-partitioned hash tables.
> > I didn't benchmarked it, but I will be surprised if it
> > matters much in performance sence.
> > 
> > Third, I put all three commits into single file to not
> > confuse commitfest application.
> 
> Thanks!  I looked into dynahash part.
> 
>  struct HASHHDR
>  {
> -   /*
> -* The freelist can become a point of contention in high-concurrency 
> hash
> 
> Why did you move around the freeList?
> 
> 
> -   longnentries;   /* number of entries in 
> associated buckets */
> +   longnfree;  /* number of free entries in 
> the list */
> +   longnalloced;   /* number of entries 
> initially allocated for
> 
> Why do we need nfree?  HASH_ASSING should do the same thing with
> HASH_REMOVE.  Maybe the reason is the code tries to put the detached
> bucket to different free list, but we can just remember the
> freelist_idx for the detached bucket as we do for hashp.  I think that
> should largely reduce the footprint of this patch.

If we keep nentries, then we need to fix nentries in both old
"freeList" partition and new one. It is two freeList[partition]->mutex
lock+unlock pairs.

But count of free elements doesn't change, so if we change nentries
to nfree, then no need to fix freeList[partition]->nfree counters,
no need to lock+unlock. 

> 
> -static void hdefault(HTAB *hashp);
> +static void hdefault(HTAB *hashp, bool partitioned);
> 
> That optimization may work even a bit, but it is not irrelevant to
> this patch?
> 
> +   case HASH_REUSE:
> +   if (currBucket != NULL)
> +   {
> +   /* check there is no unfinished 
> HASH_REUSE+HASH_ASSIGN pair */
> +   Assert(DynaHashReuse.hashp == NULL);
> +   Assert(DynaHashReuse.element == NULL);
> 
> I think all cases in the switch(action) other than HASH_ASSIGN needs
> this assertion and no need for checking both, maybe only for element
> would be enough.

Agree.





Re: BufferAlloc: don't take two simultaneous locks

2022-03-02 Thread Yura Sokolov
В Вт, 01/03/2022 в 10:24 +0300, Yura Sokolov пишет:
> Ok, here is v4.

And here is v5.

First, there was compilation error in Assert in dynahash.c .
Excuse me for not checking before sending previous version.

Second, I add third commit that reduces HASHHDR allocation
size for non-partitioned dynahash:
- moved freeList to last position
- alloc and memset offset(HASHHDR, freeList[1]) for
  non-partitioned hash tables.
I didn't benchmarked it, but I will be surprised if it
matters much in performance sence.

Third, I put all three commits into single file to not
confuse commitfest application.

 


regards

Yura Sokolov
Postgres Professional
y.soko...@postgrespro.ru
funny.fal...@gmail.com
From c1b8e6d60030d5d02287ae731ab604feeafa7486 Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH 1/3] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both lock simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.
---
 src/backend/storage/buffer/bufmgr.c | 189 +---
 1 file changed, 89 insertions(+), 100 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3f..5d2781f4813 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1288,93 +1288,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode();
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-/* only one partition, only one lock */
-LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf, true);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-oldPartitionLock != newPartitionLock)
-LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-/*
- * We can only get here if (a) someone else is still reading
- * in the page, or (b) a previous read attempt failed.  We
- * have to wait for any active read attempt to finish, and
- * then set up our own read attempt if the page is still not
- * BM_VALID.  StartBufferIO does it all.
- */
-if (StartBufferIO(buf, true))
-{
-	/*
-	 * If we get here, previous attempts to read the buffer
-	 * must have failed ... but we shall bravely try again.
-	 */
-	*foundPtr = false;
-}
-			}
-
-			return buf;
-		}
-
 		/*
 		 * Need to lock the buffer header too in order to change its tag.
 		 */
@@ -1382,40 +1305,113 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		/*
 		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
+		 * doing the I/O.  If so, we can't recycle this buffer; we must undo
+		 * everything we've done and start

Re: BufferAlloc: don't take two simultaneous locks

2022-02-28 Thread Yura Sokolov
В Пт, 25/02/2022 в 09:38 +, Simon Riggs пишет:
> On Fri, 25 Feb 2022 at 09:24, Yura Sokolov  wrote:
> 
> > > This approach is cleaner than v1, but should also perform better
> > > because there will be a 1:1 relationship between a buffer and its
> > > dynahash entry, most of the time.
> > 
> > Thank you for suggestion. Yes, it is much clearer than my initial proposal.
> > 
> > Should I incorporate it to v4 patch? Perhaps, it could be a separate
> > commit in new version.
> 
> I don't insist that you do that, but since the API changes are a few
> hours work ISTM better to include in one patch for combined perf
> testing. It would be better to put all changes in this area into PG15
> than to split it across multiple releases.
> 
> > Why there is need for this? Which way backend could be forced to abort
> > between BufTableReuse and BufTableAssign in this code path? I don't
> > see any CHECK_FOR_INTERRUPTS on the way, but may be I'm missing
> > something.
> 
> Sounds reasonable.

Ok, here is v4.
It is with two commits: one for BufferAlloc locking change and other
for dynahash's freelist avoiding.

Buffer locking patch is same to v2 with some comment changes. Ie it uses
Lock+UnlockBufHdr 

For dynahash HASH_REUSE and HASH_ASSIGN as suggested.
HASH_REUSE stores deleted element into per-process static variable.
HASH_ASSIGN uses this element instead of freelist. If there's no
such stored element, it falls back to HASH_ENTER.

I've implemented Robert Haas's suggestion to count element in freelists
instead of nentries:

> One idea is to jigger things so that we maintain a count of the total
> number of entries that doesn't change except when we allocate, and
> then for each freelist partition we maintain the number of entries in
> that freelist partition.  So then the size of the hash table, instead
> of being sum(nentries) is totalsize - sum(nfree).

https://postgr.es/m/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com

It helps to avoid freelist lock just to actualize counters.
I made it with replacing "nentries" with "nfree" and adding
"nalloced" to each freelist. It also makes "hash_update_hash_key" valid
for key that migrates partitions.

I believe, there is no need for "nalloced" for each freelist, and
instead single such field should be in HASHHDR. More, it seems to me
`element_alloc` function needs no acquiring freelist partition lock
since it is called only during initialization of shared hash table.
Am I right?

I didn't go this path in v4 for simplicity, but can put it to v5
if approved.

To be honest, "reuse" patch gives little improvement. But still
measurable on some connection numbers.

I tried to reduce freelist partitions to 8, but it has mixed impact.
Most of time performance is same, but sometimes a bit lower. I
didn't investigate reasons. Perhaps they are not related to buffer
manager.

I didn't introduce new functions BufTableReuse and BufTableAssign
since there are single call to BufTableInsert and two calls to
BufTableDelete. So I reused this functions, just added "reuse" flag
to BufTableDelete. 

Tests simple_select for Xeon 8354H, 128MB and 1G shared buffers
for scale 100.

1 socket:
  conns | master |   patch_v4 |  master 1G | patch_v4 1G 
++++
  1 |  41975 |  41540 |  52898 |  52213 
  2 |  77693 |  77908 |  97571 |  98371 
  3 | 114713 | 115522 | 142709 | 145226 
  5 | 188898 | 187617 | 239322 | 237269 
  7 | 261516 | 260006 | 329119 | 329449 
 17 | 521821 | 519473 | 672390 | 662106 
 27 | 555487 | 555697 | 674630 | 672736 
 53 | 868213 | 896539 |1190734 |1202505 
 83 | 868232 | 866029 |1164997 |1158719 
107 | 850477 | 845685 |1140597 |1134502 
139 | 816311 | 816808 |1101471 |1091258 
163 | 794788 | 796517 |1078445 |1071568 
191 | 765934 | 776185 |1059497 |1041944 
211 | 738656 | 777365 |1083356 |1046422 
239 | 713124 | 841337 |1104629 |1116668 
271 | 692138 | 847803 |1094432 |1128971 
307 | 682919 | 849239 |1086306 |1127051 
353 | 679449 | 842125 |1071482 |1117471 
397 | 676217 | 844015 |1058937 |1118628 

2 sockets:
  conns | master |   patch_v4 |  master 1G | patch_v4 1G 
++++
  1 |  44317 |  44034 |  53920 |  53583 
  2 |  81193 |  78621 |  99138 |  97968 
  3 | 120755 | 115648 | 148102 | 147423 
  5

Re: BufferAlloc: don't take two simultaneous locks

2022-02-27 Thread Yura Sokolov
В Пт, 25/02/2022 в 09:01 -0800, Andres Freund пишет:
> Hi,
> 
> On 2022-02-25 12:51:22 +0300, Yura Sokolov wrote:
> > > > +* The usage_count starts out at 1 so that the buffer can 
> > > > survive one
> > > > +* clock-sweep pass.
> > > > +*
> > > > +* We use direct atomic OR instead of Lock+Unlock since no 
> > > > other backend
> > > > +* could be interested in the buffer. But StrategyGetBuffer,
> > > > +* Flush*Buffers, Drop*Buffers are scanning all buffers and 
> > > > locks them to
> > > > +* compare tag, and UnlockBufHdr does raw write to state. So we 
> > > > have to
> > > > +* spin if we found buffer locked.
> > > 
> > > So basically the first half of of the paragraph is wrong, because no, we
> > > can't?
> > 
> > Logically, there are no backends that could be interesting in the buffer.
> > Physically they do LockBufHdr/UnlockBufHdr just to check they are not 
> > interesting.
> 
> Yea, but that's still being interested in the buffer...
> 
> 
> > > > +* Note that we write tag unlocked. It is also safe since there 
> > > > is always
> > > > +* check for BM_VALID when tag is compared.
> > > >  */
> > > > buf->tag = newTag;
> > > > -   buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
> > > > -  BM_CHECKPOINT_NEEDED | BM_IO_ERROR | 
> > > > BM_PERMANENT |
> > > > -  BUF_USAGECOUNT_MASK);
> > > > if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == 
> > > > INIT_FORKNUM)
> > > > -   buf_state |= BM_TAG_VALID | BM_PERMANENT | 
> > > > BUF_USAGECOUNT_ONE;
> > > > +   new_bits = BM_TAG_VALID | BM_PERMANENT | 
> > > > BUF_USAGECOUNT_ONE;
> > > > else
> > > > -   buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> > > > -
> > > > -   UnlockBufHdr(buf, buf_state);
> > > > +   new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> > > >  
> > > > -   if (oldPartitionLock != NULL)
> > > > +   buf_state = pg_atomic_fetch_or_u32(>state, new_bits);
> > > > +   while (unlikely(buf_state & BM_LOCKED))
> > > 
> > > I don't think it's safe to atomic in arbitrary bits. If somebody else has
> > > locked the buffer header in this moment, it'll lead to completely bogus
> > > results, because unlocking overwrites concurrently written contents (which
> > > there shouldn't be any, but here there are)...
> > 
> > That is why there is safety loop in the case buf->state were locked just
> > after first optimistic atomic_fetch_or. 99.999% times this loop will not
> > have a job. But in case other backend did lock buf->state, loop waits
> > until it releases lock and retry atomic_fetch_or.
> > > And or'ing contents in also doesn't make sense because we it doesn't work 
> > > to
> > > actually unset any contents?
> > 
> > Sorry, I didn't understand sentence :((
> 
> You're OR'ing multiple bits into buf->state. LockBufHdr() only ORs in
> BM_LOCKED. ORing BM_LOCKED is fine:
> Either the buffer is not already locked, in which case it just sets the
> BM_LOCKED bit, acquiring the lock. Or it doesn't change anything, because
> BM_LOCKED already was set.
> 
> But OR'ing in multiple bits is *not* fine, because it'll actually change the
> contents of ->state while the buffer header is locked.

First, both states are valid: before atomic_or and after.
Second, there are no checks for buffer->state while buffer header is locked.
All LockBufHdr users uses result of LockBufHdr. (I just checked that).

> > > Why don't you just use LockBufHdr/UnlockBufHdr?
> > 
> > This pair makes two atomic writes to memory. Two writes are heavier than
> > one write in this version (if optimistic case succeed).
> 
> UnlockBufHdr doesn't use a locked atomic op. It uses a write barrier and an
> unlocked write.

Write barrier is not free on any platform.

Well, while I don't see problem with modifying buffer->state, there is problem
with modifying buffer->tag: I missed Drop*Buffers doesn't check BM_TAG_VALID
flag. Therefore either I had to add this check to those places, or return to
LockBufHdr+UnlockBufHdr pair.

For patch simplicity I'll return Lock+UnlockBufHdr pair. But it has measurable
impact on low connection numbers on many-sockets.

> 
> Greetings,
> 
> Andres Freund





Re: BufferAlloc: don't take two simultaneous locks

2022-02-25 Thread Yura Sokolov
Hello, Andres

В Пт, 25/02/2022 в 00:04 -0800, Andres Freund пишет:
> Hi,
> 
> On 2022-02-21 11:06:49 +0300, Yura Sokolov wrote:
> > From 04b07d0627ec65ba3327dc8338d59dbd15c405d8 Mon Sep 17 00:00:00 2001
> > From: Yura Sokolov 
> > Date: Mon, 21 Feb 2022 08:49:03 +0300
> > Subject: [PATCH v3] [PGPRO-5616] bufmgr: do not acquire two partition locks.
> > 
> > Acquiring two partition locks leads to complex dependency chain that hurts
> > at high concurrency level.
> > 
> > There is no need to hold both lock simultaneously. Buffer is pinned so
> > other processes could not select it for eviction. If tag is cleared and
> > buffer removed from old partition other processes will not find it.
> > Therefore it is safe to release old partition lock before acquiring
> > new partition lock.
> 
> Yes, the current design is pretty nonsensical. It leads to really absurd stuff
> like holding the relation extension lock while we write out old buffer
> contents etc.
> 
> 
> 
> > +* We have pinned buffer and we are single pinner at the moment so there
> > +* is no other pinners.
> 
> Seems redundant.
> 
> 
> > We hold buffer header lock and exclusive partition
> > +* lock if tag is valid. Given these statements it is safe to clear tag
> > +* since no other process can inspect it to the moment.
> > +*/
> 
> Could we share code with InvalidateBuffer here? It's not quite the same code,
> but nearly the same.
> 
> 
> > +* The usage_count starts out at 1 so that the buffer can survive one
> > +* clock-sweep pass.
> > +*
> > +* We use direct atomic OR instead of Lock+Unlock since no other backend
> > +* could be interested in the buffer. But StrategyGetBuffer,
> > +* Flush*Buffers, Drop*Buffers are scanning all buffers and locks them 
> > to
> > +* compare tag, and UnlockBufHdr does raw write to state. So we have to
> > +* spin if we found buffer locked.
> 
> So basically the first half of of the paragraph is wrong, because no, we
> can't?

Logically, there are no backends that could be interesting in the buffer.
Physically they do LockBufHdr/UnlockBufHdr just to check they are not 
interesting.

> > +* Note that we write tag unlocked. It is also safe since there is 
> > always
> > +* check for BM_VALID when tag is compared.
> 
> 
> >  */
> > buf->tag = newTag;
> > -   buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
> > -  BM_CHECKPOINT_NEEDED | BM_IO_ERROR | 
> > BM_PERMANENT |
> > -  BUF_USAGECOUNT_MASK);
> > if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == 
> > INIT_FORKNUM)
> > -   buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> > +   new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> > else
> > -   buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> > -
> > -   UnlockBufHdr(buf, buf_state);
> > +   new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> >  
> > -   if (oldPartitionLock != NULL)
> > +   buf_state = pg_atomic_fetch_or_u32(>state, new_bits);
> > +   while (unlikely(buf_state & BM_LOCKED))
> 
> I don't think it's safe to atomic in arbitrary bits. If somebody else has
> locked the buffer header in this moment, it'll lead to completely bogus
> results, because unlocking overwrites concurrently written contents (which
> there shouldn't be any, but here there are)...

That is why there is safety loop in the case buf->state were locked just
after first optimistic atomic_fetch_or. 99.999% times this loop will not
have a job. But in case other backend did lock buf->state, loop waits
until it releases lock and retry atomic_fetch_or.

> And or'ing contents in also doesn't make sense because we it doesn't work to
> actually unset any contents?

Sorry, I didn't understand sentence :((

> Why don't you just use LockBufHdr/UnlockBufHdr?

This pair makes two atomic writes to memory. Two writes are heavier than
one write in this version (if optimistic case succeed).

But I thought to use Lock+UnlockBuhHdr instead of safety loop:

buf_state = pg_atomic_fetch_or_u32(>state, new_bits);
if (unlikely(buf_state & BM_LOCKED))
{
buf_state = LockBufHdr(>state);
UnlockBufHdr(>state, buf_state | new_bits);
}

I agree this way code is cleaner. Will do in next version.

-

regards,
Yura Sokolov





Re: BufferAlloc: don't take two simultaneous locks

2022-02-25 Thread Yura Sokolov
Hello, Simon.

В Пт, 25/02/2022 в 04:35 +, Simon Riggs пишет:
> On Mon, 21 Feb 2022 at 08:06, Yura Sokolov  wrote:
> > Good day, Kyotaro Horiguchi and hackers.
> > 
> > В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет:
> > > At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov 
> > >  wrote in
> > > > Hello, all.
> > > > 
> > > > I thought about patch simplification, and tested version
> > > > without BufTable and dynahash api change at all.
> > > > 
> > > > It performs suprisingly well. It is just a bit worse
> > > > than v1 since there is more contention around dynahash's
> > > > freelist, but most of improvement remains.
> > > > 
> > > > I'll finish benchmarking and will attach graphs with
> > > > next message. Patch is attached here.
> > > 
> > > Thanks for the new patch.  The patch as a whole looks fine to me. But
> > > some comments needs to be revised.
> > 
> > Thank you for review and remarks.
> 
> v3 gets the buffer partition locking right, well done, great results!
> 
> In v3, the comment at line 1279 still implies we take both locks
> together, which is not now the case.
> 
> Dynahash actions are still possible. You now have the BufTableDelete
> before the BufTableInsert, which opens up the possibility I discussed
> here:
> http://postgr.es/m/CANbhV-F0H-8oB_A+m=55hP0e0QRL=rdddqusxmtft6jprdx...@mail.gmail.com
> (Apologies for raising a similar topic, I hadn't noticed this thread
> before; thanks to Horiguchi-san for pointing this out).
> 
> v1 had a horrible API (sorry!) where you returned the entry and then
> explicitly re-used it. I think we *should* make changes to dynahash,
> but not with the API you proposed.
> 
> Proposal for new BufTable API
> BufTableReuse() - similar to BufTableDelete() but does NOT put entry
> back on freelist, we remember it in a private single item cache in
> dynahash
> BufTableAssign() - similar to BufTableInsert() but can only be
> executed directly after BufTableReuse(), fails with ERROR otherwise.
> Takes the entry from single item cache and re-assigns it to new tag
> 
> In dynahash we have two new modes that match the above
> HASH_REUSE - used by BufTableReuse(), similar to HASH_REMOVE, but
> places entry on the single item cache, avoiding freelist
> HASH_ASSIGN - used by BufTableAssign(), similar to HASH_ENTER, but
> uses the entry from the single item cache, rather than asking freelist
> This last call can fail if someone else already inserted the tag, in
> which case it adds the single item cache entry back onto freelist
> 
> Notice that single item cache is not in shared memory, so on abort we
> should give it back, so we probably need an extra API call for that
> also to avoid leaking an entry.

Why there is need for this? Which way backend could be forced to abort
between BufTableReuse and BufTableAssign in this code path? I don't
see any CHECK_FOR_INTERRUPTS on the way, but may be I'm missing
something.

> 
> Doing it this way allows us to
> * avoid touching freelists altogether in the common path - we know we
> are about to reassign the entry, so we do remember it - no contention
> from other backends, no borrowing etc..
> * avoid sharing the private details outside of the dynahash module
> * allows us to use the same technique elsewhere that we have
> partitioned hash tables
> 
> This approach is cleaner than v1, but should also perform better
> because there will be a 1:1 relationship between a buffer and its
> dynahash entry, most of the time.

Thank you for suggestion. Yes, it is much clearer than my initial proposal.

Should I incorporate it to v4 patch? Perhaps, it could be a separate
commit in new version.


> 
> With these changes, I think we will be able to *reduce* the number of
> freelists for partitioned dynahash from 32 to maybe 8, as originally
> speculated by Robert in 2016:
>    
> https://www.postgresql.org/message-id/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com
> since the freelists will be much less contended with the above approach
> 
> It would be useful to see performance with a higher number of connections, 
> >400.
> 
> --
> Simon Riggshttp://www.EnterpriseDB.com/

--

regards,
Yura Sokolov





Re: Accommodate startup process in a separate ProcState array slot instead of in MaxBackends slots.

2022-02-21 Thread Yura Sokolov
В Сб, 12/02/2022 в 16:56 +0530, Bharath Rupireddy пишет:
> On Fri, Feb 11, 2022 at 7:56 PM Yura Sokolov  wrote:
> > В Сб, 16/10/2021 в 16:37 +0530, Bharath Rupireddy пишет:
> > > On Thu, Oct 14, 2021 at 10:56 AM Fujii Masao
> > >  wrote:
> > > > On 2021/10/12 15:46, Bharath Rupireddy wrote:
> > > > > On Tue, Oct 12, 2021 at 5:37 AM Fujii Masao 
> > > > >  wrote:
> > > > > > On 2021/10/12 4:07, Bharath Rupireddy wrote:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > While working on [1], it is found that currently the ProcState 
> > > > > > > array
> > > > > > > doesn't have entries for auxiliary processes, it does have 
> > > > > > > entries for
> > > > > > > MaxBackends. But the startup process is eating up one slot from
> > > > > > > MaxBackends. We need to increase the size of the ProcState array 
> > > > > > > by 1
> > > > > > > at least for the startup process. The startup process uses 
> > > > > > > ProcState
> > > > > > > slot via 
> > > > > > > InitRecoveryTransactionEnvironment->SharedInvalBackendInit.
> > > > > > > The procState array size is initialized to MaxBackends in
> > > > > > > SInvalShmemSize.
> > > > > > > 
> > > > > > > The consequence of not fixing this issue is that the database may 
> > > > > > > hit
> > > > > > > the error "sorry, too many clients already" soon in
> > > > > > > SharedInvalBackendInit.
> > > > 
> > > > On second thought, I wonder if this error could not happen in practice. 
> > > > No?
> > > > Because autovacuum doesn't work during recovery and the startup process
> > > > can safely use the ProcState entry for autovacuum worker process.
> > > > Also since the minimal allowed value of autovacuum_max_workers is one,
> > > > the ProcState array guarantees to have at least one entry for 
> > > > autovacuum worker.
> > > > 
> > > > If this understanding is right, we don't need to enlarge the array and
> > > > can just update the comment. I don't strongly oppose to enlarge
> > > > the array in the master, but I'm not sure it's worth doing that
> > > > in back branches if the issue can cause no actual error.
> > > 
> > > Yes, the issue can't happen. The comment in the SInvalShmemSize,
> > > mentioning about the startup process always having an extra slot
> > > because the autovacuum worker is not active during recovery, looks
> > > okay. But, is it safe to assume that always? Do we have a way to
> > > specify that in the form an Assert(when_i_am_startup_proc &&
> > > autovacuum_not_running) (this looks a bit dirty though)? Instead, we
> > > can just enlarge the array in the master and be confident about the
> > > fact that the startup process always has one dedicated slot.
> > 
> > But this slot wont be used for most of cluster life. It will be just
> > waste.
> 
> Correct. In the standby autovacuum launcher and worker are not started
> so, the startup process will always have a slot free for it to use.
> 
> > And `Assert(there_is_startup_proc && autovacuum_not_running)` has
> > value on its own, hasn't it? So why doesn't add it with comment.
> 
> Assertion doesn't make sense to me now. Because the postmaster ensures
> that the autovacuum launcher/workers will not get started in standby
> mode and we can't reliably know in InitRecoveryTransactionEnvironment
> (startup process) whether or not autovacuum launcher process has been
> started.
> 
> FWIW, here's a patch just adding a comment on how the startup process
> can get a free procState array slot even when SInvalShmemSize hasn't
> accounted for it.

I think, comment is a good thing.
Marked as "Ready for committer".

> 
> Regards,
> Bharath Rupireddy.





Re: BufferAlloc: don't take two simultaneous locks

2022-02-21 Thread Yura Sokolov
Good day, Kyotaro Horiguchi and hackers.

В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет:
> At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov  
> wrote in 
> > Hello, all.
> > 
> > I thought about patch simplification, and tested version
> > without BufTable and dynahash api change at all.
> > 
> > It performs suprisingly well. It is just a bit worse
> > than v1 since there is more contention around dynahash's
> > freelist, but most of improvement remains.
> > 
> > I'll finish benchmarking and will attach graphs with
> > next message. Patch is attached here.
> 
> Thanks for the new patch.  The patch as a whole looks fine to me. But
> some comments needs to be revised.

Thank you for review and remarks.

> 
> (existing comments)
> > * To change the association of a valid buffer, we'll need to have
> > * exclusive lock on both the old and new mapping partitions.
> ...
> > * Somebody could have pinned or re-dirtied the buffer while we were
> > * doing the I/O and making the new hashtable entry.  If so, we can't
> > * recycle this buffer; we must undo everything we've done and start
> > * over with a new victim buffer.
> 
> We no longer take a lock on the new partition and have no new hash
> entry (if others have not yet done) at this point.

fixed

> +* Clear out the buffer's tag and flags.  We must do this to ensure 
> that
> +* linear scans of the buffer array don't think the buffer is valid. 
> We
> 
> The reason we can clear out the tag is it's safe to use the victim
> buffer at this point. This comment needs to mention that reason.

Tried to describe.

> +*
> +* Since we are single pinner, there should no be PIN_COUNT_WAITER or
> +* IO_IN_PROGRESS (flags that were not cleared in previous code).
> +*/
> +   Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);
> 
> It seems like to be a test for potential bugs in other functions.  As
> the comment is saying, we are sure that no other processes are pinning
> the buffer and the existing code doesn't seem to be care about that
> condition.  Is it really needed?

Ok, I agree this check is excess.
These two flags were not cleared in the previous code, and I didn't get
why. Probably, it is just a historical accident.

> 
> +   /*
> +* Try to make a hashtable entry for the buffer under its new tag. 
> This
> +* could fail because while we were writing someone else allocated 
> another
> 
> The most significant point of this patch is the reason that the victim
> buffer is protected from stealing until it is set up for new tag. I
> think we need an explanation about the protection here.

I don't get what you mean clearly :( . I would appreciate your
suggestion for this comment.

> 
> 
> +* buffer for the same block we want to read in. Note that we have 
> not yet
> +* removed the hashtable entry for the old tag.
> 
> Since we have removed the hash table entry for the old tag at this
> point, the comment got wrong.

Thanks, changed.

> +* the first place.  First, give up the buffer we were 
> planning to use
> +* and put it to free lists.
> ..
> +   StrategyFreeBuffer(buf);
> 
> This is one downside of this patch. But it seems to me that the odds
> are low that many buffers are freed in a short time by this logic.  By
> the way it would be better if the sentence starts with "First" has a
> separate comment section.

Splitted the comment.

> (existing comment)
> |* Okay, it's finally safe to rename the buffer.
> 
> We don't "rename" the buffer here.  And the safety is already
> establishsed at the end of the oldPartitionLock section. So it would
> be just something like "Now allocate the victim buffer for the new
> tag"?

Changed to "Now it is safe to use victim buffer for new tag."


There is also tiny code change at block reuse finalization: instead
of LockBufHdr+UnlockBufHdr I use single atomic_fetch_or protected
with WaitBufHdrUnlocked. I've tried to explain its safety. Please,
check it.


Benchmarks:
- base point is 6ce16088bfed97f9.
- notebook with i7-1165G7 and server with Xeon 8354H (1&2 sockets)
- pgbench select only scale 100 (1.5GB on disk)
- two shared_buffers values: 128MB and 1GB.
- enabled hugepages
- second best result from five runs

Notebook:
  conns | master |   patch_v3 |  master 1G | patch_v3 1G 
++++
  1 |  29508 |  29481 |  31774 |  32305 
  2 |  57139 |  56694 |  63393 |  62968 
  3 |  89759 |  90861 | 101873 | 102399 
  5 |   

Re: BufferAlloc: don't take two simultaneous locks

2022-02-15 Thread Yura Sokolov
Hello, all.

I thought about patch simplification, and tested version
without BufTable and dynahash api change at all.

It performs suprisingly well. It is just a bit worse
than v1 since there is more contention around dynahash's
freelist, but most of improvement remains.

I'll finish benchmarking and will attach graphs with
next message. Patch is attached here.

--

regards,
Yura Sokolov
Postgres Professional
y.soko...@postgrespro.ru
funny.fal...@gmail.com
From 7f430bdaa748456ed6b59f16f32ac0ea55644a66 Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Fri, 14 Jan 2022 02:28:36 +0300
Subject: [PATCH v2] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both lock simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.
---
 src/backend/storage/buffer/bufmgr.c | 179 +---
 1 file changed, 82 insertions(+), 97 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3f..abb916938a7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1288,93 +1288,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode();
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-/* only one partition, only one lock */
-LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf, true);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-oldPartitionLock != newPartitionLock)
-LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-/*
- * We can only get here if (a) someone else is still reading
- * in the page, or (b) a previous read attempt failed.  We
- * have to wait for any active read attempt to finish, and
- * then set up our own read attempt if the page is still not
- * BM_VALID.  StartBufferIO does it all.
- */
-if (StartBufferIO(buf, true))
-{
-	/*
-	 * If we get here, previous attempts to read the buffer
-	 * must have failed ... but we shall bravely try again.
-	 */
-	*foundPtr = false;
-}
-			}
-
-			return buf;
-		}
-
 		/*
 		 * Need to lock the buffer header too in order to change its tag.
 		 */
@@ -1391,31 +1314,100 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			break;
 
 		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
+		if (oldPartitionLock != NULL)
 			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
 		UnpinBuffer(buf, true);
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
+	 * Clear out the buffer's tag and flags.  We must do this to ensure that
+	 * linear scans of the buffer array don't think the buffer is valid. We
+	 * also reset the usage_count since any recency of use of the old content
+	 * is no longer relevant.
 	 *

Re: BufferAlloc: don't take two simultaneous locks

2022-02-15 Thread Yura Sokolov
В Вс, 06/02/2022 в 19:34 +0300, Michail Nikolaev пишет:
> Hello, Yura.
> 
> A one additional moment:
> 
> > 1332: Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);
> > 1333: CLEAR_BUFFERTAG(buf->tag);
> > 1334: buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
> > 1335: UnlockBufHdr(buf, buf_state);
> 
> I think there is no sense to unlock buffer here because it will be
> locked after a few moments (and no one is able to find it somehow). Of
> course, it should be unlocked in case of collision.

UnlockBufHdr actually writes buf_state. Until it called, buffer
is in intermediate state and it is ... locked.

We have to write state with BM_TAG_VALID cleared before we
call BufTableDelete and release oldPartitionLock to maintain
consistency.

Perhaps, it could be cheated, and there is no harm to skip state
write at this point. But I'm not so confident to do it.

> 
> BTW, I still think is better to introduce some kind of
> hash_update_hash_key and use it.
> 
> It may look like this:
> 
> // should be called with oldPartitionLock acquired
> // newPartitionLock hold on return
> // oldPartitionLock and newPartitionLock are not taken at the same time
> // if newKeyPtr is present - existingEntry is removed
> bool hash_update_hash_key_or_remove(
>   HTAB *hashp,
>   void *existingEntry,
>   const void *newKeyPtr,
>   uint32 newHashValue,
>   LWLock *oldPartitionLock,
>   LWLock *newPartitionLock
> );

Interesting suggestion, thanks. I'll think about.
It has downside of bringing LWLock knowdlege to dynahash.c .
But otherwise looks smart.

-

regards,
Yura Sokolov





Re: Error "initial slot snapshot too large" in create replication slot

2022-02-13 Thread Yura Sokolov
В Пн, 07/02/2022 в 13:52 +0530, Dilip Kumar пишет:
> On Mon, Jan 31, 2022 at 11:50 AM Kyotaro Horiguchi
>  wrote:
> > At Mon, 17 Jan 2022 09:27:14 +0530, Dilip Kumar  
> > wrote in
> > 
> > me> Mmm. The size of the array cannot be larger than the numbers the
> > me> *Connt() functions return.  Thus we cannot attach the oversized array
> > me> to ->subxip.  (I don't recall clearly but that would lead to assertion
> > me> failure somewhere..)
> > 
> > Then, I fixed the v3 error and post v4.
> 
> Yeah you are right, SetTransactionSnapshot() has that assertion.
> Anyway after looking again it appears that
> GetMaxSnapshotSubxidCount is the correct size because this is
> PGPROC_MAX_CACHED_SUBXIDS +1, i.e. it considers top transactions as
> well so we don't need to add them separately.
> 
> > SnapBUildInitialSnapshot tries to store XIDS of both top and sub
> > transactions into snapshot->xip array but the array is easily
> > overflowed and CREATE_REPLICATOIN_SLOT command ends with an error.
> > 
> > To fix this, this patch is doing the following things.
> > 
> > - Use subxip array instead of xip array to allow us have larger array
> >   for xids.  So the snapshot is marked as takenDuringRecovery, which
> >   is a kind of abuse but largely reduces the chance of getting
> >   "initial slot snapshot too large" error.
> 
> Right. I think the patch looks fine to me.
> 

Good day.

I've looked to the patch. Personally I'd prefer dynamically resize
xip array. But I think there is issue with upgrade if replica source
is upgraded before destination, right?

Concerning patch, I think more comments should be written about new
usage case for `takenDuringRecovery`. May be this field should be renamed
at all?

And there are checks for `takenDuringRecovery` in `heapgetpage` and
`heapam_scan_sample_next_tuple`. Are this checks affected by the change?
Neither the preceding discussion nor commit message answer me.

---

regards

Yura Sokolov
Postgres Professional
y.soko...@postgrespro.ru
funny.fal...@gmail.com





Re: Accommodate startup process in a separate ProcState array slot instead of in MaxBackends slots.

2022-02-11 Thread Yura Sokolov
В Сб, 16/10/2021 в 16:37 +0530, Bharath Rupireddy пишет:
> On Thu, Oct 14, 2021 at 10:56 AM Fujii Masao
>  wrote:
> > On 2021/10/12 15:46, Bharath Rupireddy wrote:
> > > On Tue, Oct 12, 2021 at 5:37 AM Fujii Masao  
> > > wrote:
> > > > On 2021/10/12 4:07, Bharath Rupireddy wrote:
> > > > > Hi,
> > > > > 
> > > > > While working on [1], it is found that currently the ProcState array
> > > > > doesn't have entries for auxiliary processes, it does have entries for
> > > > > MaxBackends. But the startup process is eating up one slot from
> > > > > MaxBackends. We need to increase the size of the ProcState array by 1
> > > > > at least for the startup process. The startup process uses ProcState
> > > > > slot via InitRecoveryTransactionEnvironment->SharedInvalBackendInit.
> > > > > The procState array size is initialized to MaxBackends in
> > > > > SInvalShmemSize.
> > > > > 
> > > > > The consequence of not fixing this issue is that the database may hit
> > > > > the error "sorry, too many clients already" soon in
> > > > > SharedInvalBackendInit.
> > 
> > On second thought, I wonder if this error could not happen in practice. No?
> > Because autovacuum doesn't work during recovery and the startup process
> > can safely use the ProcState entry for autovacuum worker process.
> > Also since the minimal allowed value of autovacuum_max_workers is one,
> > the ProcState array guarantees to have at least one entry for autovacuum 
> > worker.
> > 
> > If this understanding is right, we don't need to enlarge the array and
> > can just update the comment. I don't strongly oppose to enlarge
> > the array in the master, but I'm not sure it's worth doing that
> > in back branches if the issue can cause no actual error.
> 
> Yes, the issue can't happen. The comment in the SInvalShmemSize,
> mentioning about the startup process always having an extra slot
> because the autovacuum worker is not active during recovery, looks
> okay. But, is it safe to assume that always? Do we have a way to
> specify that in the form an Assert(when_i_am_startup_proc &&
> autovacuum_not_running) (this looks a bit dirty though)? Instead, we
> can just enlarge the array in the master and be confident about the
> fact that the startup process always has one dedicated slot.

But this slot wont be used for most of cluster life. It will be just
waste.

And `Assert(there_is_startup_proc && autovacuum_not_running)` has
value on its own, hasn't it? So why doesn't add it with comment.

regards,
Yura Sokolov





Re: Fix BUG #17335: Duplicate result rows in Gather node

2022-01-25 Thread Yura Sokolov
В Вт, 25/01/2022 в 21:20 +1300, David Rowley пишет:
> On Tue, 25 Jan 2022 at 20:03, David Rowley  wrote:
> > On Tue, 25 Jan 2022 at 17:35, Yura Sokolov  wrote:
> > > And another attempt to fix tests volatility.
> > 
> > FWIW, I had not really seen the point in adding a test for this.   I
> > did however see a point in it with your original patch. It seemed
> > useful there to verify that Gather and GatherMerge did what we
> > expected with 1 worker.
> 
> I ended up pushing just the last patch I sent.
> 
> The reason I didn't think it was worth adding a new test was that no
> tests were added in the original commit.  Existing tests did cover it,

Existed tests didn't catched the issue. It is pitty fix is merged
without test case it fixes.

> but here we're just restoring the original behaviour for one simple
> case.  The test in your patch just seemed a bit more hassle than it
> was worth. I struggle to imagine how we'll break this again.

Thank you for attention and for fix.

regards,
Yura Sokolov.





Re: Fix BUG #17335: Duplicate result rows in Gather node

2022-01-24 Thread Yura Sokolov
В Пн, 24/01/2022 в 16:24 +0300, Yura Sokolov пишет:
> В Вс, 23/01/2022 в 14:56 +0300, Yura Sokolov пишет:
> > В Чт, 20/01/2022 в 09:32 +1300, David Rowley пишет:
> > > On Fri, 31 Dec 2021 at 00:14, Yura Sokolov  
> > > wrote:
> > > > Suggested quick (and valid) fix in the patch attached:
> > > > - If Append has single child, then copy its parallel awareness.
> > > 
> > > I've been looking at this and I've gone through changing my mind about
> > > what's the right fix quite a number of times.
> > > 
> > > My current thoughts are that I don't really like the fact that we can
> > > have plans in the following shape:
> > > 
> > >  Finalize Aggregate
> > >->  Gather
> > >  Workers Planned: 1
> > >  ->  Partial Aggregate
> > >->  Parallel Hash Left Join
> > >  Hash Cond: (gather_append_1.fk = gather_append_2.fk)
> > >  ->  Index Scan using gather_append_1_ix on 
> > > gather_append_1
> > >Index Cond: (f = true)
> > >  ->  Parallel Hash
> > >->  Parallel Seq Scan on gather_append_2
> > > 
> > > It's only made safe by the fact that Gather will only use 1 worker.
> > > To me, it just seems too fragile to assume that's always going to be
> > > the case. I feel like this fix just relies on the fact that
> > > create_gather_path() and create_gather_merge_path() do
> > > "pathnode->num_workers = subpath->parallel_workers;". If someone
> > > decided that was to work a different way, then we risk this breaking
> > > again. Additionally, today we have Gather and GatherMerge, but we may
> > > one day end up with more node types that gather results from parallel
> > > workers, or even a completely different way of executing plans.
> > 
> > It seems strange parallel_aware and parallel_safe flags neither affect
> > execution nor are properly checked.
> > 
> > Except parallel_safe is checked in ExecSerializePlan which is called from
> > ExecInitParallelPlan, which is called from ExecGather and ExecGatherMerge.
> > But looks like this check doesn't affect execution as well.
> > 
> > > I think a safer way to fix this is to just not remove the
> > > Append/MergeAppend node if the parallel_aware flag of the only-child
> > > and the Append/MergeAppend don't match. I've done that in the
> > > attached.
> > > 
> > > I believe the code at the end of add_paths_to_append_rel() can remain as 
> > > is.
> > 
> > I found clean_up_removed_plan_level also called from 
> > set_subqueryscan_references.
> > Is there a need to patch there as well?
> > 
> > And there is strange state:
> > - in the loop by subpaths, pathnode->node.parallel_safe is set to AND of
> >   all its subpath's parallel_safe
> >   (therefore there were need to copy it in my patch version),
> > - that means, our AppendPath is parallel_aware but not parallel_safe.
> > It is ridiculous a bit.
> > 
> > And it is strange AppendPath could have more parallel_workers than sum of
> > its children parallel_workers.
> > 
> > So it looks like whole machinery around parallel_aware/parallel_safe has
> > no enough consistency.
> > 
> > Either way, I attach you version of fix with my tests as new patch version.
> 
> Looks like volatile "Memory Usage:" in EXPLAIN brokes 'make check'
> sporadically.
> 
> Applied replacement in style of memoize.sql test.
> 
> Why there is no way to disable "Buckets: %d Buffers: %d Memory Usage: %dkB"
> output in show_hash_info?

And another attempt to fix tests volatility.
From fb09491a401f0df828faf6088158f431b2a69381 Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Sun, 23 Jan 2022 14:53:21 +0300
Subject: [PATCH v4] Fix duplicate result rows after Append path removal.

It could happen Append path is created with "parallel_aware" flag,
but its single child is not. Append path parent (Gather or Gather Merge)
thinks its child is parallel_aware, but after Append path removal Gather's
child become not parallel_aware. Then when Gather/Gather Merge decides
to run child in several workers or worker + leader participation, it
gathers duplicate result rows from several child path invocations.

To fix it don't remove Append/MergeAppend node if it's parallel_aware !=
single child parallel_aware.

Authors: David Rowley, Sokolov Yura.
---
 src/backend/optimizer/plan/setrefs.c  |  24 +++-

Re: Fix BUG #17335: Duplicate result rows in Gather node

2022-01-24 Thread Yura Sokolov
В Вс, 23/01/2022 в 14:56 +0300, Yura Sokolov пишет:
> В Чт, 20/01/2022 в 09:32 +1300, David Rowley пишет:
> > On Fri, 31 Dec 2021 at 00:14, Yura Sokolov  wrote:
> > > Suggested quick (and valid) fix in the patch attached:
> > > - If Append has single child, then copy its parallel awareness.
> > 
> > I've been looking at this and I've gone through changing my mind about
> > what's the right fix quite a number of times.
> > 
> > My current thoughts are that I don't really like the fact that we can
> > have plans in the following shape:
> > 
> >  Finalize Aggregate
> >->  Gather
> >  Workers Planned: 1
> >  ->  Partial Aggregate
> >->  Parallel Hash Left Join
> >  Hash Cond: (gather_append_1.fk = gather_append_2.fk)
> >  ->  Index Scan using gather_append_1_ix on 
> > gather_append_1
> >Index Cond: (f = true)
> >  ->  Parallel Hash
> >->  Parallel Seq Scan on gather_append_2
> > 
> > It's only made safe by the fact that Gather will only use 1 worker.
> > To me, it just seems too fragile to assume that's always going to be
> > the case. I feel like this fix just relies on the fact that
> > create_gather_path() and create_gather_merge_path() do
> > "pathnode->num_workers = subpath->parallel_workers;". If someone
> > decided that was to work a different way, then we risk this breaking
> > again. Additionally, today we have Gather and GatherMerge, but we may
> > one day end up with more node types that gather results from parallel
> > workers, or even a completely different way of executing plans.
> 
> It seems strange parallel_aware and parallel_safe flags neither affect
> execution nor are properly checked.
> 
> Except parallel_safe is checked in ExecSerializePlan which is called from
> ExecInitParallelPlan, which is called from ExecGather and ExecGatherMerge.
> But looks like this check doesn't affect execution as well.
> 
> > I think a safer way to fix this is to just not remove the
> > Append/MergeAppend node if the parallel_aware flag of the only-child
> > and the Append/MergeAppend don't match. I've done that in the
> > attached.
> > 
> > I believe the code at the end of add_paths_to_append_rel() can remain as is.
> 
> I found clean_up_removed_plan_level also called from 
> set_subqueryscan_references.
> Is there a need to patch there as well?
> 
> And there is strange state:
> - in the loop by subpaths, pathnode->node.parallel_safe is set to AND of
>   all its subpath's parallel_safe
>   (therefore there were need to copy it in my patch version),
> - that means, our AppendPath is parallel_aware but not parallel_safe.
> It is ridiculous a bit.
> 
> And it is strange AppendPath could have more parallel_workers than sum of
> its children parallel_workers.
> 
> So it looks like whole machinery around parallel_aware/parallel_safe has
> no enough consistency.
> 
> Either way, I attach you version of fix with my tests as new patch version.

Looks like volatile "Memory Usage:" in EXPLAIN brokes 'make check'
sporadically.

Applied replacement in style of memoize.sql test.

Why there is no way to disable "Buckets: %d Buffers: %d Memory Usage: %dkB"
output in show_hash_info?

regards,
Yura Sokolov
From 9ed2139495b2026433ff5e7c4092fcfa8f10e4d1 Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Sun, 23 Jan 2022 14:53:21 +0300
Subject: [PATCH v3] Fix duplicate result rows after Append path removal.

It could happen Append path is created with "parallel_aware" flag,
but its single child is not. Append path parent (Gather or Gather Merge)
thinks its child is parallel_aware, but after Append path removal Gather's
child become not parallel_aware. Then when Gather/Gather Merge decides
to run child in several workers or worker + leader participation, it
gathers duplicate result rows from several child path invocations.

To fix it don't remove Append/MergeAppend node if it's parallel_aware !=
single child parallel_aware.

Authors: David Rowley, Sokolov Yura.
---
 src/backend/optimizer/plan/setrefs.c  |  24 ++-
 .../expected/gather_removed_append.out| 154 ++
 src/test/regress/parallel_schedule|   1 +
 .../regress/sql/gather_removed_append.sql | 102 
 4 files changed, 277 insertions(+), 4 deletions(-)
 create mode 100644 src/test/regress/expected/gather_removed_append.out
 create mode 100644 src/test/regress/sql/gather_removed_append.sql

diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index e44ae971b4b..a

Re: Fix BUG #17335: Duplicate result rows in Gather node

2022-01-23 Thread Yura Sokolov
В Чт, 20/01/2022 в 09:32 +1300, David Rowley пишет:
> On Fri, 31 Dec 2021 at 00:14, Yura Sokolov  wrote:
> > Suggested quick (and valid) fix in the patch attached:
> > - If Append has single child, then copy its parallel awareness.
> 
> I've been looking at this and I've gone through changing my mind about
> what's the right fix quite a number of times.
> 
> My current thoughts are that I don't really like the fact that we can
> have plans in the following shape:
> 
>  Finalize Aggregate
>->  Gather
>  Workers Planned: 1
>  ->  Partial Aggregate
>->  Parallel Hash Left Join
>  Hash Cond: (gather_append_1.fk = gather_append_2.fk)
>  ->  Index Scan using gather_append_1_ix on 
> gather_append_1
>Index Cond: (f = true)
>  ->  Parallel Hash
>->  Parallel Seq Scan on gather_append_2
> 
> It's only made safe by the fact that Gather will only use 1 worker.
> To me, it just seems too fragile to assume that's always going to be
> the case. I feel like this fix just relies on the fact that
> create_gather_path() and create_gather_merge_path() do
> "pathnode->num_workers = subpath->parallel_workers;". If someone
> decided that was to work a different way, then we risk this breaking
> again. Additionally, today we have Gather and GatherMerge, but we may
> one day end up with more node types that gather results from parallel
> workers, or even a completely different way of executing plans.

It seems strange parallel_aware and parallel_safe flags neither affect
execution nor are properly checked.

Except parallel_safe is checked in ExecSerializePlan which is called from
ExecInitParallelPlan, which is called from ExecGather and ExecGatherMerge.
But looks like this check doesn't affect execution as well.

> 
> I think a safer way to fix this is to just not remove the
> Append/MergeAppend node if the parallel_aware flag of the only-child
> and the Append/MergeAppend don't match. I've done that in the
> attached.
> 
> I believe the code at the end of add_paths_to_append_rel() can remain as is.

I found clean_up_removed_plan_level also called from 
set_subqueryscan_references.
Is there a need to patch there as well?

And there is strange state:
- in the loop by subpaths, pathnode->node.parallel_safe is set to AND of
  all its subpath's parallel_safe
  (therefore there were need to copy it in my patch version),
- that means, our AppendPath is parallel_aware but not parallel_safe.
It is ridiculous a bit.

And it is strange AppendPath could have more parallel_workers than sum of
its children parallel_workers.

So it looks like whole machinery around parallel_aware/parallel_safe has
no enough consistency.

Either way, I attach you version of fix with my tests as new patch version.

regards,
Yura Sokolov
From 359df37ae76170a4621cafd3ad8b318473c94a46 Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Sun, 23 Jan 2022 14:53:21 +0300
Subject: [PATCH v2] Fix duplicate result rows after Append path removal.

It could happen Append path is created with "parallel_aware" flag,
but its single child is not. Append path parent (Gather or Gather Merge)
thinks its child is parallel_aware, but after Append path removal Gather's
child become not parallel_aware. Then when Gather/Gather Merge decides
to run child in several workers or worker + leader participation, it
gathers duplicate result rows from several child path invocations.

To fix it don't remove Append/MergeAppend node if it's parallel_aware !=
single child parallel_aware.

Authors: David Rowley, Sokolov Yura.
---
 src/backend/optimizer/plan/setrefs.c  |  24 +++-
 .../expected/gather_removed_append.out| 135 ++
 src/test/regress/parallel_schedule|   1 +
 .../regress/sql/gather_removed_append.sql |  82 +++
 4 files changed, 238 insertions(+), 4 deletions(-)
 create mode 100644 src/test/regress/expected/gather_removed_append.out
 create mode 100644 src/test/regress/sql/gather_removed_append.sql

diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index e44ae971b4b..a7b11b7f03a 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -1512,8 +1512,16 @@ set_append_references(PlannerInfo *root,
 		lfirst(l) = set_plan_refs(root, (Plan *) lfirst(l), rtoffset);
 	}
 
-	/* Now, if there's just one, forget the Append and return that child */
-	if (list_length(aplan->appendplans) == 1)
+	/*
+	 * See if it's safe to get rid of the Append entirely.  For this to be
+	 * safe, there must be only one child plan and that child plan's parallel
+	 * awareness must match that of the Append's.  The reason for the latter
+	 * is that the if the Append is par

Re: Fix BUG #17335: Duplicate result rows in Gather node

2022-01-16 Thread Yura Sokolov
В Сб, 01/01/2022 в 15:19 +1300, David Rowley пишет:
> On Fri, 31 Dec 2021 at 00:14, Yura Sokolov  wrote:
> > Problem:
> > - Append path is created with explicitely parallel_aware = true
> > - It has two child, one is trivial, other is parallel_aware = false .
> >   Trivial child is dropped.
> > - Gather/GatherMerge path takes Append path as a child and thinks
> >   its child is parallel_aware = true.
> > - But Append path is removed at the last since it has only one child.
> > - Now Gather/GatherMerge thinks its child is parallel_aware, but it
> >   is not.
> >   Gather/GatherMerge runs its child twice: in a worker and in a leader,
> >   and gathers same rows twice.
> 
> Thanks for the report. I can confirm that I can recreate the problem
> with your script.
> 
> I will look into this further later next week.
> 

Good day, David.

Excuse me for disturbing.
Any update on this?
Any chance to be fixed in next minor release?
Could this simple fix be merged before further improvements?

Yura.





Fix BUG #17335: Duplicate result rows in Gather node

2021-12-30 Thread Yura Sokolov
Good day, hackers.

Problem:
- Append path is created with explicitely parallel_aware = true
- It has two child, one is trivial, other is parallel_aware = false .
  Trivial child is dropped.
- Gather/GatherMerge path takes Append path as a child and thinks
  its child is parallel_aware = true.
- But Append path is removed at the last since it has only one child.
- Now Gather/GatherMerge thinks its child is parallel_aware, but it
  is not.
  Gather/GatherMerge runs its child twice: in a worker and in a leader,
  and gathers same rows twice.

Reproduction code attached (repro.sql. Included as a test as well).

Suggested quick (and valid) fix in the patch attached:
- If Append has single child, then copy its parallel awareness.

Bug were introduced with commit 8edd0e79460b414b1d971895312e549e95e12e4f
"Suppress Append and MergeAppend plan nodes that have a single child."

During discussion, it were supposed [1] those fields should be copied:

> I haven't looked into whether this does the right things for parallel
> planning --- possibly create_[merge]append_path need to propagate up
> parallel-related path fields from the single child?

But it were not so obvious [2].

Better fix could contain removing Gather/GatherMerge node as well if
its child is not parallel aware.

Bug is reported in 
https://postgr.es/m/flat/17335-4dc92e1aea3a78af%40postgresql.org
Since no way to add thread from pgsql-bugs to commitfest, I write here.

[1] https://postgr.es/m/17500.1551669976%40sss.pgh.pa.us
[2] 
https://postgr.es/m/CAKJS1f_Wt_tL3S32R3wpU86zQjuHfbnZbFt0eqm%3DqcRFcdbLvw%40mail.gmail.com

 
regards
Yura Sokolov
y.soko...@postgrespro.ru
funny.fal...@gmail.com
From 47c6e161de4fc9d2d6eff45f427ebf49b4c9d11c Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Mon, 20 Dec 2021 11:48:10 +0300
Subject: [PATCH v1] Quick fix to duplicate result rows after Append path
 removal.

It could happen Append path is created with "parallel_aware" flag,
but its single child is not. Append path parent (Gather or Gather Merge)
thinks its child is parallel_aware, but after Append path removal Gather's
child become not parallel_aware. Then when Gather/Gather Merge decides
to run child in several workers or worker + leader participation, it
gathers duplicate result rows from several child path invocations.

With this fix Append path copies its single child parallel_aware / cost /
workers values.

Copied `num_workers == 0` triggers assert `num_workers > 0` in
cost_gather_merge function. But changing this assert to `num_workers >= 0`
doesn't lead to any runtime and/or logical error.

Fixes bug 17335
https://postgr.es/m/flat/17335-4dc92e1aea3a78af%40postgresql.org
---
 src/backend/optimizer/path/costsize.c |   2 +-
 src/backend/optimizer/util/pathnode.c |   3 +
 .../expected/gather_removed_append.out| 131 ++
 src/test/regress/parallel_schedule|   1 +
 .../regress/sql/gather_removed_append.sql |  82 +++
 5 files changed, 218 insertions(+), 1 deletion(-)
 create mode 100644 src/test/regress/expected/gather_removed_append.out
 create mode 100644 src/test/regress/sql/gather_removed_append.sql

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1e4d404f024..9949c3ab555 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -440,7 +440,7 @@ cost_gather_merge(GatherMergePath *path, PlannerInfo *root,
 	 * be overgenerous since the leader will do less work than other workers
 	 * in typical cases, but we'll go with it for now.
 	 */
-	Assert(path->num_workers > 0);
+	Assert(path->num_workers >= 0);
 	N = (double) path->num_workers + 1;
 	logN = LOG2(N);
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index af5e8df26b4..2ff4678937a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1340,6 +1340,9 @@ create_append_path(PlannerInfo *root,
 		pathnode->path.startup_cost = child->startup_cost;
 		pathnode->path.total_cost = child->total_cost;
 		pathnode->path.pathkeys = child->pathkeys;
+		pathnode->path.parallel_aware = child->parallel_aware;
+		pathnode->path.parallel_safe = child->parallel_safe;
+		pathnode->path.parallel_workers = child->parallel_workers;
 	}
 	else
 		cost_append(pathnode);
diff --git a/src/test/regress/expected/gather_removed_append.out b/src/test/regress/expected/gather_removed_append.out
new file mode 100644
index 000..f6e861ce59d
--- /dev/null
+++ b/src/test/regress/expected/gather_removed_append.out
@@ -0,0 +1,131 @@
+-- Test correctness of parallel query execution after removal
+-- of Append path due to single non-trivial child.
+DROP TABLE IF EXISTS gather_append_1, gather_append_2;
+NOTICE:  table "gather_append_1" does not exist, skipping
+NOTICE:  table "gather_ap

Re: BufferAlloc: don't take two simultaneous locks

2021-12-20 Thread Yura Sokolov
В Сб, 02/10/2021 в 01:25 +0300, Yura Sokolov пишет:
> Good day.
> 
> I found some opportunity in Buffer Manager code in BufferAlloc
> function:
> - When valid buffer is evicted, BufferAlloc acquires two partition
> lwlocks: for partition for evicted block is in and partition for new
> block placement.
> 
> It doesn't matter if there is small number of concurrent replacements.
> But if there are a lot of concurrent backends replacing buffers,
> complex dependency net quickly arose.
> 
> It could be easily seen with select-only pgbench with scale 100 and
> shared buffers 128MB: scale 100 produces 1.5GB tables, and it certainly
> doesn't fit shared buffers. This way performance starts to degrade at
> ~100 connections. Even with shared buffers 1GB it slowly degrades after
> 150 connections. 
> 
> But strictly speaking, there is no need to hold both lock
> simultaneously. Buffer is pinned so other processes could not select it
> for eviction. If tag is cleared and buffer removed from old partition
> then other processes will not find it. Therefore it is safe to release
> old partition lock before acquiring new partition lock.
> 
> If other process concurrently inserts same new buffer, then old buffer
> is placed to bufmanager's freelist.
> 
> Additional optimisation: in case of old buffer is reused, there is no
> need to put its BufferLookupEnt into dynahash's freelist. That reduces
> lock contention a bit more. To acomplish this FreeListData.nentries is
> changed to pg_atomic_u32/pg_atomic_u64 and atomic increment/decrement
> is used.
> 
> Remark: there were bug in the `hash_update_hash_key`: nentries were not
> kept in sync if freelist partitions differ. This bug were never
> triggered because single use of `hash_update_hash_key` doesn't move
> entry between partitions.
> 
> There is some tests results.
> 
> - pgbench with scale 100 were tested with --select-only (since we want
> to test buffer manager alone). It produces 1.5GB table.
> - two shared_buffers values were tested: 128MB and 1GB.
> - second best result were taken among five runs
> 
> Test were made in three system configurations:
> - notebook with i7-1165G7 (limited to 2.8GHz to not overheat)
> - Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads).
> - same Xeon X5675 but restricted to single socket
>   (with numactl -m 0 -N 0)
> 
> Results for i7-1165G7:
> 
>   conns | master |patched |  master 1G | patched 1G 
> ++++
>   1 |  29667 |  29079 |  29425 |  29411 
>   2 |  55577 |  3 |  57974 |  57223 
>   3 |  87393 |  87924 |  87246 |  89210 
>   5 | 136222 | 136879 | 133775 | 133949 
>   7 | 179865 | 176734 | 178297 | 175559 
>  17 | 215953 | 214708 | 222908 | 223651 
>  27 | 211162 | 213014 | 220506 | 219752 
>  53 | 211620 | 218702 | 220906 | 225218 
>  83 | 213488 | 221799 | 219075 | 228096 
> 107 | 212018 | 222110 | 222502 | 227825 
> 139 | 207068 | 220812 | 218191 | 226712 
> 163 | 203716 | 220793 | 213498 | 226493 
> 191 | 199248 | 217486 | 210994 | 221026 
> 211 | 195887 | 217356 | 209601 | 219397 
> 239 | 193133 | 215695 | 209023 | 218773 
> 271 | 190686 | 213668 | 207181 | 219137 
> 307 | 188066 | 214120 | 205392 | 218782 
> 353 | 185449 | 213570 | 202120 | 217786 
> 397 | 182173 | 212168 | 201285 | 216489 
> 
> Results for 1 socket X5675
> 
>   conns | master |patched |  master 1G | patched 1G 
> ++++
>   1 |  16864 |  16584 |  17419 |  17630 
>   2 |  32764 |  32735 |  34593 |  34000 
>   3 |  47258 |  46022 |  49570 |  47432 
>   5 |  64487 |  64929 |  68369 |  68885 
>   7 |  81932 |  82034 |  87543 |  87538 
>  17 | 114502 | 114218 | 127347 | 127448 
>  27 | 116030 | 115758 | 130003 | 128890 
>  53 | 116814 | 117197 | 131142 | 131080 
>  83 | 114438 | 116704 | 130198 | 130985 
> 107 | 113255 | 116910 | 129932 | 131468 
> 139 | 111577 | 116929 | 129012 | 131782 
> 163 | 110477 | 116818 | 128628 | 131697 
> 191 | 109237 | 116672 | 127833 | 131586 
> 211 | 108248 | 116396 | 127

Re: XTS cipher mode for cluster file encryption

2021-10-26 Thread Yura Sokolov
В Вт, 26/10/2021 в 11:08 +0800, Sasasu пишет:
> On 2021/10/26 04:32, Yura Sokolov wrote:
> > And among others Adiantum looks best: it is fast even without hardware
> > acceleration,
> 
> No, AES is fast on modern high-end hardware.
> 
> on X86 AMD 3700X
> type  1024 bytes  8192 bytes   16384 bytes
> aes-128-ctr   8963982.50k 11124613.88k 11509149.42k
> aes-128-gcm   3978860.44k 4669417.10k  4732070.64k
> aes-128-xts   7776628.39k 9073664.63k  9264617.74k
> chacha20-poly1305 2043729.73k 2131296.36k  2141002.10k
> 
> on ARM RK3399, A53 middle-end with AES-NI
> type  1024 bytes   8192 bytes   16384 bytes
> aes-128-ctr   1663857.66k  1860930.22k  1872991.57k
> aes-128-xts   685086.38k   712906.07k   716073.64k
> aes-128-gcm   985578.84k   1054818.30k  1056768.00k
> chacha20-poly1305 309012.82k   318889.98k   319711.91k
> 
> I think the baseline is the speed when using read(2) syscall on 
> /dev/zero (which is 3.6GiB/s, on ARM is 980MiB/s)
> chacha is fast on the low-end arm, but I haven't seen any HTTPS sites 
> using chacha, including Cloudflare and Google.

1. Chacha20-poly1305 includes authentication code (poly1305),
   aes-gcm also includes (GCM).
   But aes-128-(ctr,xts) doesn't.
   Therefore, Chacha should be compared with ctr,xts, not Chacha-Poly1305.
2. Chacha20 has security margin x2.8: only 7 rounds from 20 are broken.
   AES-128 has security margin x1.4: broken 7 rounds from 10.
   That is why Adiantum uses Chacha12: it is still "more secure" than AES-128.

Yes, AES with AES-NI is fastest. But not so much.

And, AES-CTR could be easily used instead of ChaCha12 in Adiantum.
Adiantum uses ChaCha12 as a stream cipher, and any other stream cipher will
be ok as well with minor modifications to algorithm. 

> 
> On 2021/10/26 04:32, Yura Sokolov wrote:
>  >> That sounds like a great thing to think about adding ... after we get
>  >> something in that's based on XTS.
>  > Why? I see no points to do it after. Why not XTS after Adiantum?
>  >
>  > Ok, I see one: XTS is standartized.
> :>
> PostgreSQL even not discuss single-table key rotation or remote KMS.
> I think it's too hard to use an encryption algorithm which openssl 
> doesn't implement.

That is argument. But, again, openssl could be used for primitives:
AES + AES-CTR + Poly/GCM. And Adiantum like construction could be
composed from them quite easily.





Re: XTS cipher mode for cluster file encryption

2021-10-25 Thread Yura Sokolov
В Пн, 25/10/2021 в 12:12 -0400, Stephen Frost пишет:
> Greetings,
> 
> * Yura Sokolov (y.soko...@postgrespro.ru) wrote:
> > В Чт, 21/10/2021 в 13:28 -0400, Stephen Frost пишет:
> > > I really don't think this is necessary.  Similar to PageSetChecksumCopy
> > > and PageSetChecksumInplace, I'm sure we would have functions which are
> > > called in the appropriate spots to do encryption (such as 'encrypt_page'
> > > and 'encrypt_block' in the Cybertec patch) and folks could review those
> > > in relative isolation to the rest.  Dealing with blocks in PG is already
> > > pretty well handled, the infrastructure that needs to be added is around
> > > handling temporary files and is being actively worked on ... if we could
> > > move past this debate around if we should be adding support for XTS or
> > > if only GCM-SIV would be accepted.
> > > 
> > > .
> > > 
> > > No, the CTR approach isn't great because, as has been discussed quite a
> > > bit already, using the LSN as the IV means that different data might be
> > > re-encrypted with the same LSN and that's not an acceptable thing to
> > > have happen with CTR.
> > > 
> > > .
> > > 
> > > We've discussed at length how using CTR for heap isn't a good idea even
> > > if we're using the LSN for the IV, while if we use XTS then we don't
> > > have the issues that CTR has with IV re-use and using the LSN (plus
> > > block number and perhaps other things).  Nothing in what has been
> > > discussed here has really changed anything there that I can see and so
> > > it's unclear to me why we continue to go round and round with it.
> > > 
> > 
> > Instead of debatting XTS vs GCM-SIV I'd suggest Google's Adiantum [1][2]
> > [3][4].
> 
> That sounds like a great thing to think about adding ... after we get
> something in that's based on XTS.

Why? I see no points to do it after. Why not XTS after Adiantum?

Ok, I see one: XTS is standartized.

> > It is explicitely created to solve large block encryption issue - disk
> > encryption. It is used to encrypt 4kb at whole, but in fact has no
> > (practical) limit on block size: it is near-zero modified to encrypt 1kb
> > or 8kb or 32kb.
> > 
> > It has benefits of both XTS and GCM-SIV:
> > - like GCM-SIV every bit of cipher text depends on every bit of plain text
> > - therefore like GCM-SIV it is resistant to IV reuse: it is safe to reuse
> >   LSN+reloid+blocknumber tuple as IV even for hint-bit changes since every
> >   block's bit will change.
> 
> The advantage of GCM-SIV is that it provides integrity as well as
> confidentiality.

Integrity could be based on simple non-cryptographic checksum, and it could
be checked after decryption. It would be imposible to intentionally change
encrypted page in a way it will pass checksum after decription. 

Currently we have 16bit checksum, and it is very small. But having larger
checksum is orthogonal (ie doesn't bound) to having encryption.

In fact, Adiantum is easily made close to SIV construction:
- just leave last 8/16 bytes zero. If after decription they are zero,
  then integrity check passed.
That is because SIV and Adiantum are very similar in its structure:
- SIV:
-- hash
-- then stream cipher
- Adiantum:
-- hash (except last 16bytes)
-- then encrypt last 16bytes with hash,
-- then stream cipher
-- then hash.
If last N (N>16) bytes is nonce + zero bytes, then "hash, then encrypt last
16bytes with hash" become equivalent to just "hash", and Adiantum became
logical equivalent to SIV.

> > - like XTS it doesn't need to change plain text format and doesn't need in
> >   additional Nonce/Auth Code.
> 
> Sure, in which case it's something that could potentially be added later
> as another option in the future.  I don't think we'll always have just
> one encryption method and it's good to generally think about what it
> might look like to have others but I don't think it makes sense to try
> and get everything in all at once.

And among others Adiantum looks best: it is fast even without hardware
acceleration, it provides whole block encryption (ie every bit depends
on every bit) and it doesn't bound to plain-text format.

> Thanks,
> 
> Stephen

regards,

Yura

PS. Construction beside SIV and Adiantum could be used with XTS as well.
I.e. instead of AES-GCM-SIV it could be AES-XTS-SIV.
And same way AES-XTS could be used instead of Chacha12 in Adiantum.





Re: XTS cipher mode for cluster file encryption

2021-10-25 Thread Yura Sokolov
В Чт, 21/10/2021 в 13:28 -0400, Stephen Frost пишет:
> Greetings,
> 
> I really don't think this is necessary.  Similar to PageSetChecksumCopy
> and PageSetChecksumInplace, I'm sure we would have functions which are
> called in the appropriate spots to do encryption (such as 'encrypt_page'
> and 'encrypt_block' in the Cybertec patch) and folks could review those
> in relative isolation to the rest.  Dealing with blocks in PG is already
> pretty well handled, the infrastructure that needs to be added is around
> handling temporary files and is being actively worked on ... if we could
> move past this debate around if we should be adding support for XTS or
> if only GCM-SIV would be accepted.
> 
> .
> 
> No, the CTR approach isn't great because, as has been discussed quite a
> bit already, using the LSN as the IV means that different data might be
> re-encrypted with the same LSN and that's not an acceptable thing to
> have happen with CTR.
> 
> .
> 
> We've discussed at length how using CTR for heap isn't a good idea even
> if we're using the LSN for the IV, while if we use XTS then we don't
> have the issues that CTR has with IV re-use and using the LSN (plus
> block number and perhaps other things).  Nothing in what has been
> discussed here has really changed anything there that I can see and so
> it's unclear to me why we continue to go round and round with it.
> 

Instead of debatting XTS vs GCM-SIV I'd suggest Google's Adiantum [1][2]
[3][4].

It is explicitely created to solve large block encryption issue - disk
encryption. It is used to encrypt 4kb at whole, but in fact has no
(practical) limit on block size: it is near-zero modified to encrypt 1kb
or 8kb or 32kb.

It has benefits of both XTS and GCM-SIV:
- like GCM-SIV every bit of cipher text depends on every bit of plain text
- therefore like GCM-SIV it is resistant to IV reuse: it is safe to reuse
  LSN+reloid+blocknumber tuple as IV even for hint-bit changes since every
  block's bit will change.
- like XTS it doesn't need to change plain text format and doesn't need in
  additional Nonce/Auth Code.

Adiantum stands on "giant's shoulders": AES, Chacha and Poly1305.
It is included into Linux kernel since 5.0 .

Adiantum/HPolyC approach (hash+cipher+stream-cipher+hash) could be used
with other primitives as well. For example, Chacha12 could be replaces
with AES-GCM or AES-XTS with IV derived from hash+cipher.

[1] 
https://security.googleblog.com/2019/02/introducing-adiantum-encryption-for.html
[2] https://en.wikipedia.org/wiki/Adiantum_(cipher)
[3] https://tosc.iacr.org/index.php/ToSC/article/view/7360
[4] https://github.com/google/adiantum
[5] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=059c2a4d8e164dccc3078e49e7f286023b019a98

---

regards
Yura Sokolov
y.soko...@postgrespro.ru
funny.fal...@gmail.com





Re: Double partition lock in bufmgr

2021-10-11 Thread Yura Sokolov
В Пт, 18/12/2020 в 15:20 +0300, Konstantin Knizhnik пишет:
> Hi hackers,
> 
> I am investigating incident with one of out customers: performance of 
> the system isdropped dramatically.
> Stack traces of all backends can be found here: 
> http://www.garret.ru/diag_20201217_102056.stacks_59644
> (this file is 6Mb so I have not attached it to this mail).
> 
> What I have see in this stack traces is that 642 backends and blocked
> in 
> LWLockAcquire,
> mostly in obtaining shared buffer lock:
> 
> #0  0x7f0e7fe7a087 in semop () from /lib64/libc.so.6
> #1  0x00682fb1 in PGSemaphoreLock 
> (sema=sema@entry=0x7f0e1c1f63a0) at pg_sema.c:387
> #2  0x006ed60b in LWLockAcquire (lock=lock@entry=0x7e8b6176d80
> 0, 
> mode=mode@entry=LW_SHARED) at lwlock.c:1338
> #3  0x006c88a7 in BufferAlloc (foundPtr=0x7ffcc3c8de9b
> "\001", 
> strategy=0x0, blockNum=997, forkNum=MAIN_FORKNUM, relpersistence=112 
> 'p', smgr=0x2fb2df8) at bufmgr.c:1177
> #4  ReadBuffer_common (smgr=0x2fb2df8, relpersistence= out>, 
> relkind=, forkNum=forkNum@entry=MAIN_FORKNUM, 
> blockNum=blockNum@entry=997, mode=RBM_NORMAL, strategy=0x0, 
> hit=hit@entry=0x7ffcc3c8df97 "") at bufmgr.c:894
> #5  0x006c928b in ReadBufferExtended (reln=0x32c7ed0, 
> forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=997, 
> mode=mode@entry=RBM_NORMAL, strategy=strategy@entry=0x0) at
> bufmgr.c:753
> #6  0x006c93ab in ReadBuffer (blockNum=, 
> reln=) at bufmgr.c:685
> ...
> 
> Only 11 locks from this 642 are unique.
> Moreover: 358 backends are waiting for one lock and 183 - for another.
> 
> There are two backends (pids 291121 and 285927) which are trying to 
> obtain exclusive lock while already holding another exclusive lock.
> And them block all other backends.
> 
> This is single place in bufmgr (and in postgres) where process tries
> to 
> lock two buffers:
> 
>  /*
>   * To change the association of a valid buffer, we'll need to
> have
>   * exclusive lock on both the old and new mapping partitions.
>   */
>  if (oldFlags & BM_TAG_VALID)
>  {
>  ...
>  /*
>   * Must lock the lower-numbered partition first to avoid
>   * deadlocks.
>   */
>  if (oldPartitionLock < newPartitionLock)
>  {
>  LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
>  LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
>  }
>  else if (oldPartitionLock > newPartitionLock)
>  {
>  LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
>  LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
>  }
> 
> This two backends are blocked in the second lock request.
> I read all connects in bufmgr.c and README file but didn't find 
> explanation why do we need to lock both partitions.
> Why it is not possible first free old buffer (as it is done in 
> InvalidateBuffer) and then repeat attempt to allocate the buffer?
> 
> Yes, it may require more efforts than just "gabbing" the buffer.
> But in this case there is no need to keep two locks.
> 
> I wonder if somebody in the past  faced with the similar symptoms and 
> was this problem with holding locks of two partitions in bufmgr
> already 
> discussed?

Looks like there is no real need for this double lock. And the change to
consequitive lock acquisition really provides scalability gain:
https://bit.ly/3AytNoN

regards
Sokolov Yura
y.soko...@postgrespro.ru
funny.fal...@gmail.com





Re: Speed up transaction completion faster after many relations are accessed in a transaction

2021-10-06 Thread Yura Sokolov
I've made some remarks in related thread:
https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1FB976EF@G01JPEXMBYT05

The new status of this patch is: Waiting on Author


Re: Use simplehash.h instead of dynahash in SMgr

2021-10-06 Thread Yura Sokolov
Good day, David and all.

В Вт, 05/10/2021 в 11:07 +1300, David Rowley пишет:
> On Mon, 4 Oct 2021 at 20:37, Jaime Casanova
>  wrote:
> > Based on your comments I will mark this patch as withdrawn at midday
> > of
> > my monday unless someone objects to that.
> 
> I really think we need a hash table implementation that's faster than
> dynahash and supports stable pointers to elements (simplehash does not
> have stable pointers). I think withdrawing this won't help us move
> towards getting that.

Agree with you. I believe densehash could replace both dynahash and
simplehash. Shared memory usages of dynahash should be reworked to
other less dynamic hash structure. So there should be densehash for
local hashes and statichash for static shared memory.

densehash slight slowness compared to simplehash in some operations
doesn't worth keeping simplehash beside densehash.

> Thomas voiced his concerns here about having an extra hash table
> implementation and then also concerns that I've coded the hash table
> code to be fast to iterate over the hashed items.  To be honest, I
> think both Andres and Thomas must be misunderstanding the bitmap part.
> I get the impression that they both think the bitmap is solely there
> to make interations faster, but in reality it's primarily there as a
> compact freelist and can also be used to make iterations over sparsely
> populated tables fast. For the freelist we look for 0-bits, and we
> look for 1-bits during iteration.

I think this part is overengineered. More below.

> I think I'd much rather talk about the concerns here than just
> withdraw this. Even if what I have today just serves as something to
> aid discussion.
> 
> It would also be good to get the points Andres raised with me off-list
> on this thread.  I think his primary concern was that bitmaps are
> slow, but I don't really think maintaining full pointers into freed
> items is going to improve the performance of this.
> 
> David

First on "quirks" in the patch I was able to see:

DH_NEXT_ZEROBIT:

   DH_BITMAP_WORD mask = (~(DH_BITMAP_WORD) 0) << DH_BITNUM(prevbit);
   DH_BITMAP_WORD word = ~(words[wordnum] & mask); /* flip bits */

really should be

   DH_BITMAP_WORD mask = (~(DH_BITMAP_WORD) 0) << DH_BITNUM(prevbit);
   DH_BITMAP_WORD word = (~words[wordnum]) & mask; /* flip bits */

But it doesn't harm because DH_NEXT_ZEROBIT is always called with
`prevbit = -1`, which is incremented to `0`. Therefore `mask` is always
`0x...ff`.

DH_INDEX_TO_ELEMENT

   /* ensure this segment is marked as used */
should be
   /* ensure this item is marked as used in the segment */

DH_GET_NEXT_UNUSED_ENTRY

   /* find the first segment with an unused item */
   while (seg != NULL && seg->nitems == DH_ITEMS_PER_SEGMENT)
   seg = tb->segments[++segidx];

No protection for `++segidx <= tb->nsegments` . I understand, it could
not happen due to `grow_threshold` is always lesser than
`nsegment * DH_ITEMS_PER_SEGMENT`. But at least comment should be
leaved about legality of absence of check.

Now architecture notes:

I don't believe there is need for configurable DH_ITEMS_PER_SEGMENT. I
don't event believe it should be not equal to 16 (or 8). Then segment
needs only one `used_items` word, which simplifies code a lot.
There is no much difference in overhead between 1/16 and 1/256.

And then I believe, segment doesn't need both `nitems` and `used_items`.
Condition "segment is full" will be equal to `used_items == 0x`.

Next, I think it is better to make real free list instead of looping to
search such one. Ie add `uint32 DH_SEGMENT->next` field and maintain
list starting from `first_free_segment`.
If concern were "allocate from lower-numbered segments first", than min-
heap could be created. It is possible to create very efficient non-
balanced "binary heap" with just two fields (`uint32 left, right`).
Algorithmic PoC in Ruby language is attached.

There is also allocation concern: AllocSet tends to allocate in power2
sizes. Use of power2 segments with header (nitems/used_items) certainly
will lead to wasted 2x space on every segment if element size is also
power2, and a bit lesser for other element sizes.
There could be two workarounds:
- make segment a bit less capable (15 elements instead of 16)
- move header from segment itself to `DH_TYPE->segments` array.
I think, second option is more prefered:
- `DH_TYPE->segments[x]` inevitable accessed on every operation,
  therefore why not store some info here?
- if nitems/used_items will be in `DH_TYPE->segments[x]`, then
  hashtable iteration doesn't need bitmap at all - there will be no need
  in `DH_TYPE->used_segments` bitmap. Absence of this bitmap will
  reduce overhead on usual operations (insert/delete) as well.

Hope I was useful.

regards

Yura Sokolov
y.soko...@postgrespro.ru
funny.fal...@gmail.com


heap.rb
Description: application/ruby


Re: BufferAlloc: don't take two simultaneous locks

2021-10-03 Thread Yura Sokolov
В Пт, 01/10/2021 в 15:46 -0700, Zhihong Yu wrote:
> 
> 
> On Fri, Oct 1, 2021 at 3:26 PM Yura Sokolov 
> wrote:
> > Good day.
> > 
> > I found some opportunity in Buffer Manager code in BufferAlloc
> > function:
> > - When valid buffer is evicted, BufferAlloc acquires two partition
> > lwlocks: for partition for evicted block is in and partition for new
> > block placement.
> > 
> > It doesn't matter if there is small number of concurrent
> > replacements.
> > But if there are a lot of concurrent backends replacing buffers,
> > complex dependency net quickly arose.
> > 
> > It could be easily seen with select-only pgbench with scale 100 and
> > shared buffers 128MB: scale 100 produces 1.5GB tables, and it
> > certainly
> > doesn't fit shared buffers. This way performance starts to degrade
> > at
> > ~100 connections. Even with shared buffers 1GB it slowly degrades
> > after
> > 150 connections. 
> > 
> > But strictly speaking, there is no need to hold both lock
> > simultaneously. Buffer is pinned so other processes could not select
> > it
> > for eviction. If tag is cleared and buffer removed from old
> > partition
> > then other processes will not find it. Therefore it is safe to
> > release
> > old partition lock before acquiring new partition lock.
> > 
> > If other process concurrently inserts same new buffer, then old
> > buffer
> > is placed to bufmanager's freelist.
> > 
> > Additional optimisation: in case of old buffer is reused, there is
> > no
> > need to put its BufferLookupEnt into dynahash's freelist. That
> > reduces
> > lock contention a bit more. To acomplish this FreeListData.nentries
> > is
> > changed to pg_atomic_u32/pg_atomic_u64 and atomic
> > increment/decrement
> > is used.
> > 
> > Remark: there were bug in the `hash_update_hash_key`: nentries were
> > not
> > kept in sync if freelist partitions differ. This bug were never
> > triggered because single use of `hash_update_hash_key` doesn't move
> > entry between partitions.
> > 
> > There is some tests results.
> > 
> > - pgbench with scale 100 were tested with --select-only (since we
> > want
> > to test buffer manager alone). It produces 1.5GB table.
> > - two shared_buffers values were tested: 128MB and 1GB.
> > - second best result were taken among five runs
> > 
> > Test were made in three system configurations:
> > - notebook with i7-1165G7 (limited to 2.8GHz to not overheat)
> > - Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads).
> > - same Xeon X5675 but restricted to single socket
> >   (with numactl -m 0 -N 0)
> > 
> > Results for i7-1165G7:
> > 
> >   conns | master |patched |  master 1G | patched 1G 
> > ++++
> >   1 |  29667 |  29079 |  29425 |  29411 
> >   2 |  55577 |  3 |  57974 |  57223 
> >   3 |  87393 |  87924 |  87246 |  89210 
> >   5 | 136222 | 136879 | 133775 | 133949 
> >   7 | 179865 | 176734 | 178297 | 175559 
> >  17 | 215953 | 214708 | 222908 | 223651 
> >  27 | 211162 | 213014 | 220506 | 219752 
> >  53 | 211620 | 218702 | 220906 | 225218 
> >  83 | 213488 | 221799 | 219075 | 228096 
> > 107 | 212018 | 222110 | 222502 | 227825 
> > 139 | 207068 | 220812 | 218191 | 226712 
> > 163 | 203716 | 220793 | 213498 | 226493 
> > 191 | 199248 | 217486 | 210994 | 221026 
> > 211 | 195887 | 217356 | 209601 | 219397 
> > 239 | 193133 | 215695 | 209023 | 218773 
> > 271 | 190686 | 213668 | 207181 | 219137 
> > 307 | 188066 | 214120 | 205392 | 218782 
> > 353 | 185449 | 213570 | 202120 | 217786 
> > 397 | 182173 | 212168 | 201285 | 216489 
> > 
> > Results for 1 socket X5675
> > 
> >   conns | master |patched |  master 1G | patched 1G 
> > ++++
> >   1 |  16864 |  16584 |  17419 |  17630 
> >   2 |  32764 |  32735 |  34593 |  34000 
> >   3 |  47258 |  46022 |  49570 |  47432 
> >   5 |  64487 |  64929 |  68369 |  68885 
> >  

BufferAlloc: don't take two simultaneous locks

2021-10-01 Thread Yura Sokolov
 |  17598 
  2 |  30510 |  31431 |  33763 |  31690 
  3 |  45051 |  45834 |  48896 |  47991 
  5 |  71800 |  73208 |  78077 |  74714 
  7 |  89792 |  89980 |  95986 |  96662 
 17 | 178319 | 177979 | 195566 | 196143 
 27 | 210475 | 205209 | 226966 | 235249 
 53 | 222857 | 220256 | 252673 | 251041 
 83 | 219652 | 219938 | 250309 | 250464 
107 | 218468 | 219849 | 251312 | 251425 
139 | 210486 | 217003 | 250029 | 250695 
163 | 204068 | 218424 | 248234 | 252940 
191 | 200014 | 218224 | 246622 | 253331 
211 | 197608 | 218033 | 245331 | 253055 
239 | 195036 | 218398 | 243306 | 253394 
271 | 192780 | 217747 | 241406 | 253148 
307 | 189490 | 217607 | 239246 | 253373 
353 | 186104 | 216697 | 236952 | 253034 
397 | 183507 | 216324 | 234764 | 252872 

As can be seen, patched version degrades much slower than master.
(Or even doesn't degrade with 1G shared buffer on older processor).

PS.

There is a room for further improvements:
- buffer manager's freelist could be partitioned
- dynahash's freelist could be sized/aligned to CPU cache line
- in fact, there is no need in dynahash at all. It is better to make
  custom hash-table using BufferDesc as entries. BufferDesc has spare
  space for link to next and hashvalue.

regards,
Yura Sokolov
y.soko...@postgrespro.ru
funny.fal...@gmail.com
From a1606eaa124fc497763ed5e28e22cbc8f6443b33 Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Wed, 22 Sep 2021 13:10:37 +0300
Subject: [PATCH v0] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both lock simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.

This change requires to manually return BufferDesc to free list.

Also insertion and deletion to dynahash is optimized by avoiding
unnecessary free list manipulations in common case (when buffer is
reused)

Also small and never triggered bug in hash_update_hash_key is fixed.
---
 src/backend/storage/buffer/buf_table.c |  54 +++--
 src/backend/storage/buffer/bufmgr.c| 183 
 src/backend/utils/hash/dynahash.c  | 289 +++--
 src/include/storage/buf_internals.h|   6 +-
 src/include/utils/hsearch.h|  17 ++
 5 files changed, 404 insertions(+), 145 deletions(-)

diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index caa03ae1233..05e1dc9dd29 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -107,36 +107,29 @@ BufTableLookup(BufferTag *tagPtr, uint32 hashcode)
 
 /*
  * BufTableInsert
- *		Insert a hashtable entry for given tag and buffer ID,
- *		unless an entry already exists for that tag
- *
- * Returns -1 on successful insertion.  If a conflicting entry exists
- * already, returns the buffer ID in that entry.
+ *		Insert a hashtable entry for given tag and buffer ID.
+ *		Caller should be sure there is no conflicting entry.
  *
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
+ * and call BufTableLookup to check for conflicting entry.
+ *
+ * If oldelem is passed it is reused.
  */
-int
-BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
+void
+BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id, void *oldelem)
 {
 	BufferLookupEnt *result;
-	bool		found;
 
 	Assert(buf_id >= 0);		/* -1 is reserved for not-in-table */
 	Assert(tagPtr->blockNum != P_NEW);	/* invalid tag */
 
 	result = (BufferLookupEnt *)
-		hash_search_with_hash_value(SharedBufHash,
-	(void *) tagPtr,
-	hashcode,
-	HASH_ENTER,
-	);
-
-	if (found)	/* found something already in the table */
-		return result->id;
+		hash_insert_with_hash_nocheck(SharedBufHash,
+	  (void *) tagPtr,
+	  hashcode,
+	  oldelem);
 
 	result->id = buf_id;
-
-	return -1;
 }
 
 /*
@@ -144,19 +137,32 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
  *		Delete the hashtable entry for given tag (which must exist)
  *
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
+ *
+ * Returns pointer to internal hashtable entry that should be passed either
+ * to BufTableInsert or BufTableFreeDeleted.
  */
-void
+void *
 BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
 {
 	BufferLookupEnt *result;
 
 	result = (BufferLookupEnt *)
-		hash_search_with_hash_value(S

Avoid dynahash's freelist in BufferAlloc.

2021-09-22 Thread Yura Sokolov
Good day.

I found BufferAlloc unnecessary goes through dynahash's freelist when it
reuses valid buffer.

If it is avoided and dynahash's entry directly moved, 1-2% is gained in
select only pgbench (with scale factor 100 in 50 connections/50 threads
on 4 core 8ht notebook cpu 185krps=>190krps).

I've changed speculative call to BufferInsert to BufferLookup to avoid
insertion too early. (It also saves call to BufferDelete if conflicting
entry is already in). Then if buffer is valid and no conflicting entry
in a dynahash I'm moving old dynahash entry directly and without check
(since we already did the check).

If old buffer were invalid, new entry is unavoidably fetched from
freelist and inserted (also without check). But in steady state (if
there is no dropped/truncated tables/indices/databases) it is rare case.

Regards,
Sokolov Yura @ Postgres Professional
y.soko...@postgrespro.ru
funny.fal...@gmail.com
From 2afbc5a59c3ccd3fec14105aff40252eeaacf40c Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Wed, 22 Sep 2021 13:10:37 +0300
Subject: [PATCH] More gentle dynahash usage in buffer allocation.

When BufferAlloc reuses some valid buffer, there is no need
to go through dynahash's free list.

---
 src/backend/storage/buffer/buf_table.c |  38 ++---
 src/backend/storage/buffer/bufmgr.c|   9 +-
 src/backend/utils/hash/dynahash.c  | 190 +
 src/include/storage/buf_internals.h|   5 +-
 src/include/utils/hsearch.h|   8 ++
 5 files changed, 230 insertions(+), 20 deletions(-)

diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index caa03ae1233..382c84f5149 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -107,36 +107,42 @@ BufTableLookup(BufferTag *tagPtr, uint32 hashcode)
 
 /*
  * BufTableInsert
- *		Insert a hashtable entry for given tag and buffer ID,
- *		unless an entry already exists for that tag
- *
- * Returns -1 on successful insertion.  If a conflicting entry exists
- * already, returns the buffer ID in that entry.
+ *		Insert a hashtable entry for given tag and buffer ID.
+ *		Caller should be sure there is no conflicting entry.
  *
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
+ * and call BufTableLookup to check for conflicting entry.
  */
-int
+void
 BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
 {
 	BufferLookupEnt *result;
-	bool		found;
 
 	Assert(buf_id >= 0);		/* -1 is reserved for not-in-table */
 	Assert(tagPtr->blockNum != P_NEW);	/* invalid tag */
 
 	result = (BufferLookupEnt *)
-		hash_search_with_hash_value(SharedBufHash,
-	(void *) tagPtr,
-	hashcode,
-	HASH_ENTER,
-	);
-
-	if (found)	/* found something already in the table */
-		return result->id;
+		hash_insert_with_hash_nocheck(SharedBufHash,
+	  (void *) tagPtr,
+	  hashcode);
 
 	result->id = buf_id;
+}
+
+void
+BufTableMove(BufferTag *oldTagPtr, uint32 oldHash,
+			 BufferTag *newTagPtr, uint32 newHash,
+			 int buf_id)
+{
+	BufferLookupEnt *result PG_USED_FOR_ASSERTS_ONLY;
 
-	return -1;
+	result = (BufferLookupEnt *)
+		hash_update_hash_key_with_hash_nocheck(SharedBufHash,
+			   (void *) oldTagPtr,
+			   oldHash,
+			   (void *) newTagPtr,
+			   newHash);
+	Assert(result->id == buf_id);
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e88e4e918b0..8fbc676811c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1325,7 +1325,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Note that we have not yet removed the hashtable entry for the old
 		 * tag.
 		 */
-		buf_id = BufTableInsert(, newHash, buf->buf_id);
+		buf_id = BufTableLookup(, newHash);
 
 		if (buf_id >= 0)
 		{
@@ -1391,7 +1391,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			break;
 
 		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(, newHash);
 		if (oldPartitionLock != NULL &&
 			oldPartitionLock != newPartitionLock)
 			LWLockRelease(oldPartitionLock);
@@ -1425,10 +1424,14 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (oldPartitionLock != NULL)
 	{
-		BufTableDelete(, oldHash);
+		BufTableMove(, oldHash, , newHash, buf->buf_id);
 		if (oldPartitionLock != newPartitionLock)
 			LWLockRelease(oldPartitionLock);
 	}
+	else
+	{
+		BufTableInsert(, newHash, buf->buf_id);
+	}
 
 	LWLockRelease(newPartitionLock);
 
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 6546e3c7c79..c0625927131 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -1288,6 +1288,196 @@ hash_update_hash_key(HTAB *hashp,
 	return true;
 }
 
+/*
+ * hash_update_hash_key_nocheck -- change the hash key of an existing table

jff: checksum algorithm is not as intended

2021-08-29 Thread Yura Sokolov

Good day.

Current checksum is not calculated in intended way and
has the flaw.

Single round function is written as:

#define CHECKSUM_COMP(checksum, value) do {\
uint32 __tmp = (checksum) ^ (value);\
(checksum) = __tmp * FNV_PRIME ^ (__tmp >> 17);\
} while (0)

And looks like it was intended to be
(checksum) = (__tmp * FNV_PRIME) ^ (__tmp >> 17);

At least original Florian Pflug suggestion were correctly written
in this way (but with shift 1):
https://www.postgresql.org/message-id/99343716-5F5A-45C8-B2F6-74B9BA357396%40phlo.org

But due to C operation precedence it is actually calculated as:
(checksum) = __tmp * (FNV_PRIME ^ (__tmp >> 17));

It has more longer collision chains and worse: it has 256 pathological
result slots of shape 0xXX00 each has 517 collisions in average.
Totally 132352 __tmp values are collided into this 256 slots.

That is happens due to if top 16 bits are happens to be
0x0326 or 0x0327, then `FNV_PRIME ^ (__tmp >> 17) == 0x100`,
and then `__tmp * 0x100` keeps only lower 8 bits. That means,
9 bits masked by 0x0001ff00 are completely lost.

mix(0x03260001) == mix(0x03260101) = mix(0x0327aa01) == mix(0x0327ff01)
(where mix is a `__tmp` to `checksum` transformation)

regards,
Yura Sokolov
y.soko...@postgrespro.ru
funny.fal...@gmail.com

PS. Test program in Crystal language is attached and output for current
CHECKSUM_COMP implementation and "correct" (intended).
Excuse me for Crystal, it is prettier to write for small compiled 
scripts.- 17 items at 38 slots
- 16 items at 114 slots
- 15 items at 456 slots
- 14 items at 1690 slots
- 13 items at 3706 slots
- 12 items at 11128 slots
- 11 items at 30994 slots
- 10 items at 84940 slots
- 9 items at 259160 slots
- 8 items at 778102 slots
- 7 items at 2413624 slots
- 6 items at 8670652 slots
- 5 items at 27772912 slots
- 4 items at 93426142 slots
- 3 items at 288544882 slots
- 2 items at 752648630 slots
- 1 items at 1332584708 slots
- 0 items at 1787735418 slots
Pathological 513 collisions at slot 
Pathological 515 collisions at slot 0100
Pathological 517 collisions at slot 0200
Pathological 518 collisions at slot 0300
Pathological 516 collisions at slot 0400
Pathological 514 collisions at slot 0500
Pathological 516 collisions at slot 0600
Pathological 516 collisions at slot 0700
Pathological 515 collisions at slot 0800
Pathological 518 collisions at slot 0900
Pathological 516 collisions at slot 0a00
Pathological 516 collisions at slot 0b00
Pathological 517 collisions at slot 0c00
Pathological 515 collisions at slot 0d00
Pathological 516 collisions at slot 0e00
Pathological 514 collisions at slot 0f00
Pathological 515 collisions at slot 1000
Pathological 516 collisions at slot 1100
Pathological 515 collisions at slot 1200
Pathological 518 collisions at slot 1300
Pathological 514 collisions at slot 1400
Pathological 516 collisions at slot 1500
Pathological 515 collisions at slot 1600
Pathological 516 collisions at slot 1700
Pathological 515 collisions at slot 1800
Pathological 516 collisions at slot 1900
Pathological 515 collisions at slot 1a00
Pathological 519 collisions at slot 1b00
Pathological 516 collisions at slot 1c00
Pathological 517 collisions at slot 1d00
Pathological 515 collisions at slot 1e00
Pathological 517 collisions at slot 1f00
Pathological 514 collisions at slot 2000
Pathological 517 collisions at slot 2100
Pathological 517 collisions at slot 2200
Pathological 518 collisions at slot 2300
Pathological 515 collisions at slot 2400
Pathological 517 collisions at slot 2500
Pathological 519 collisions at slot 2600
Pathological 516 collisions at slot 2700
Pathological 514 collisions at slot 2800
Pathological 518 collisions at slot 2900
Pathological 515 collisions at slot 2a00
Pathological 516 collisions at slot 2b00
Pathological 517 collisions at slot 2c00
Pathological 514 collisions at slot 2d00
Pathological 516 collisions at slot 2e00
Pathological 517 collisions at slot 2f00
Pathological 515 collisions at slot 3000
Pathological 517 collisions at slot 3100
Pathological 518 collisions at slot 3200
Pathological 519 collisions at slot 3300
Pathological 515 collisions at slot 3400
Pathological 519 collisions at slot 3500
Pathological 517 collisions at slot 3600
Pathological 517 collisions at slot 3700
Pathological 516 collisions at slot 3800
Pathological 516 collisions at slot 3900
Pathological 519 collisions at slot 3a00
Pathological 517 collisions at slot 3b00
Pathological 517 collisions at slot 3c00
Pathological 516 collisions at slot 3d00
Pathological 517 collisions at slot 3e00
Pathological 518 collisions at slot 3f00
Pathological 514 collisions at slot 4000
Pathological 518 collisions at slot 4100
P

Re: Bug in huge simplehash

2021-08-13 Thread Yura Sokolov

Ranier Vilela писал 2021-08-13 14:12:

Em sex., 13 de ago. de 2021 às 07:15, Andres Freund
 escreveu:


Hi,

On 2021-08-13 12:44:17 +0300, Yura Sokolov wrote:

Andres Freund писал 2021-08-13 12:21:

Any chance you'd write a test for simplehash with such huge

amount of

values? It'd require a small bit of trickery to be practical. On

systems

with MAP_NORESERVE it should be feasible.


Which way C structures should be tested in postgres?
dynahash/simplhash - do they have any tests currently?
I'll do if hint is given.


We don't have a great way right now :(. I think the best is to have
a
SQL callable function that uses some API. See e.g. test_atomic_ops()
et
al in src/test/regress/regress.c


>  static inline void
> -SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize)
> +SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint64 newsize)
>  {
>   uint64  size;
>
> @@ -322,11 +322,7 @@ SH_COMPUTE_PARAMETERS(SH_TYPE * tb,

uint32

> newsize)
>
>   /* now set size */
>   tb->size = size;
> -
> - if (tb->size == SH_MAX_SIZE)
> - tb->sizemask = 0;
> - else
> - tb->sizemask = tb->size - 1;
> + tb->sizemask = (uint32)(size - 1);

ISTM using ~0 would be nicer here?


I don't think so.
To be rigid it should be `~(uint32)0`.
But I believe it doesn't differ from `tb->sizemask = (uint32)(size

- 1)`

that is landed with patch, therefore why `if` is needed?


Personally I find it more obvious to understand the intended
behaviour
with ~0 (i.e. all bits set) than with a width truncation.


https://godbolt.org/z/57WcjKqMj
The generated code is identical.


I believe, you mean https://godbolt.org/z/qWzE1ePTE


regards,
Ranier Vilela





Re: Bug in huge simplehash

2021-08-13 Thread Yura Sokolov

Andres Freund писал 2021-08-13 12:21:

Hi,

On 2021-08-10 11:52:59 +0300, Yura Sokolov wrote:
- sizemask is set only in SH_COMPUTE_PARAMETERS . And it is set in 
this way:


/* now set size */
tb->size = size;

if (tb->size == SH_MAX_SIZE)
tb->sizemask = 0;
else
tb->sizemask = tb->size - 1;

  that means, when we are resizing to SH_MAX_SIZE, sizemask becomes 
zero.


I think that was intended to be ~0.


I believe so.


Ahh... ok, patch is updated to fix this as well.


Any chance you'd write a test for simplehash with such huge amount of
values? It'd require a small bit of trickery to be practical. On 
systems

with MAP_NORESERVE it should be feasible.


Which way C structures should be tested in postgres?
dynahash/simplhash - do they have any tests currently?
I'll do if hint is given.


 static inline void
-SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize)
+SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint64 newsize)
 {
uint64  size;

@@ -322,11 +322,7 @@ SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 
newsize)


/* now set size */
tb->size = size;
-
-   if (tb->size == SH_MAX_SIZE)
-   tb->sizemask = 0;
-   else
-   tb->sizemask = tb->size - 1;
+   tb->sizemask = (uint32)(size - 1);


ISTM using ~0 would be nicer here?


I don't think so.
To be rigid it should be `~(uint32)0`.
But I believe it doesn't differ from `tb->sizemask = (uint32)(size - 1)`
that is landed with patch, therefore why `if` is needed?



Greetings,

Andres Freund





Re: Bug in huge simplehash

2021-08-10 Thread Yura Sokolov

Ranier Vilela писал 2021-08-10 14:21:

Em ter., 10 de ago. de 2021 às 05:53, Yura Sokolov
 escreveu:



I went to check SH_GROW and It is `SH_GROW(SH_TYPE *tb, uint32
newsize)`
:-(((
Therefore when `tb->size == SH_MAX_SIZE/2` and we call `SH_GROW(tb,
tb->size * 2)`,
then SH_GROW(tb, 0) is called due to truncation.
And SH_COMPUTE_PARAMETERS is also accepts `uint32 newsize`.

Ahh... ok, patch is updated to fix this as well.


It seems that we need to fix the function prototype too.

/* void _grow(_hash *tb) */
-SH_SCOPE void SH_GROW(SH_TYPE * tb, uint32 newsize); +SH_SCOPE void
SH_GROW(SH_TYPE * tb, uint64 newsize);


Ahh... Thank you, Ranier.
Attached v2.

regards,

-

Yura SokolovFrom 82f449896d62be8440934d955d4e368f057005a6 Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Tue, 10 Aug 2021 11:51:16 +0300
Subject: [PATCH] Fix new size and sizemask computaton in simplehash.h

Fix couple of 32/64bit related errors in simplehash.h:
- size of SH_GROW and SH_COMPUTE_PARAMETERS arguments
- computation of tb->sizemask.
---
 src/include/lib/simplehash.h | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index da51781e98e..adda5598338 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -198,8 +198,8 @@ SH_SCOPE void SH_DESTROY(SH_TYPE * tb);
 /* void _reset(_hash *tb) */
 SH_SCOPE void SH_RESET(SH_TYPE * tb);
 
-/* void _grow(_hash *tb) */
-SH_SCOPE void SH_GROW(SH_TYPE * tb, uint32 newsize);
+/* void _grow(_hash *tb, uint64 newsize) */
+SH_SCOPE void SH_GROW(SH_TYPE * tb, uint64 newsize);
 
 /*  *_insert(_hash *tb,  key, bool *found) */
 SH_SCOPE	SH_ELEMENT_TYPE *SH_INSERT(SH_TYPE * tb, SH_KEY_TYPE key, bool *found);
@@ -302,7 +302,7 @@ SH_SCOPE void SH_STAT(SH_TYPE * tb);
  * the hashtable.
  */
 static inline void
-SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize)
+SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint64 newsize)
 {
 	uint64		size;
 
@@ -322,11 +322,7 @@ SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize)
 
 	/* now set size */
 	tb->size = size;
-
-	if (tb->size == SH_MAX_SIZE)
-		tb->sizemask = 0;
-	else
-		tb->sizemask = tb->size - 1;
+	tb->sizemask = (uint32)(size - 1);
 
 	/*
 	 * Compute the next threshold at which we need to grow the hash table
@@ -476,7 +472,7 @@ SH_RESET(SH_TYPE * tb)
  * performance-wise, when known at some point.
  */
 SH_SCOPE void
-SH_GROW(SH_TYPE * tb, uint32 newsize)
+SH_GROW(SH_TYPE * tb, uint64 newsize)
 {
 	uint64		oldsize = tb->size;
 	SH_ELEMENT_TYPE *olddata = tb->data;
-- 
2.32.0



Bug in huge simplehash

2021-08-10 Thread Yura Sokolov

Good day, hackers.

Our client caught process stuck in tuplehash_grow. There was a query 
like
`select ts, count(*) from really_huge_partitioned_table group by ts`, 
and

planner decided to use hash aggregation.

Backtrace shows that oldsize were 2147483648 at the moment. While 
newsize

were optimized, looks like it were SH_MAX_SIZE.

#0  0x00603d0c in tuplehash_grow (tb=0x7f18c3c284c8, 
newsize=) at ../../../src/include/lib/simplehash.h:457

hash = 2147483654
startelem = 1
curelem = 1
oldentry = 0x7f00c299e0d8
oldsize = 2147483648
olddata = 0x7f00c299e048
newdata = 0x32e0448
i = 6
copyelem = 6

EXPLAIN shows that there are 2604186278 rows in all partitions, but 
planner
thinks there will be only 200 unique rows after group by. Looks like we 
was

mistaken.

 Finalize GroupAggregate  (cost=154211885.42..154211936.09 rows=200 
width=16)

Group Key: really_huge_partitioned_table.ts
->  Gather Merge  (cost=154211885.42..154211932.09 rows=400 
width=16)

  Workers Planned: 2
  ->  Sort  (cost=154210885.39..154210885.89 rows=200 width=16)
Sort Key: really_huge_partitioned_table.ts
->  Partial HashAggregate  
(cost=154210875.75..154210877.75 rows=200 width=16)

  Group Key: really_huge_partitioned_table.ts
  ->  Append  (cost=0.43..141189944.36 
rows=2604186278 width=8)
->  Parallel Index Only Scan using 
really_huge_partitioned_table_001_idx2 on 
really_huge_partitioned_table_001  (cost=0.43..108117.92 rows=2236977 
width=8)
->  Parallel Index Only Scan using 
really_huge_partitioned_table_002_idx2 on 
really_huge_partitioned_table_002  (cost=0.43..114928.19 rows=2377989 
width=8)

 and more than 400 partitions more

After some investigation I found bug that is present in simplehash from 
its

beginning:

- sizemask is set only in SH_COMPUTE_PARAMETERS . And it is set in this 
way:


/* now set size */
tb->size = size;

if (tb->size == SH_MAX_SIZE)
tb->sizemask = 0;
else
tb->sizemask = tb->size - 1;

  that means, when we are resizing to SH_MAX_SIZE, sizemask becomes 
zero.


- then sizemask is used to SH_INITIAL_BUCKET and SH_NEXT to compute 
initial and

  next position:

  SH_INITIAL_BUCKET(SH_TYPE * tb, uint32 hash)
return hash & tb->sizemask;
  SH_NEXT(SH_TYPE * tb, uint32 curelem, uint32 startelem)
curelem = (curelem + 1) & tb->sizemask;

- and then SH_GROW stuck in element placing loop:

  startelem = SH_INITIAL_BUCKET(tb, hash);
  curelem = startelem;
  while (true)
curelem = SH_NEXT(tb, curelem, startelem);

There is Assert(curelem != startelem) in SH_NEXT, but since no one test 
it
with 2 billion elements, it were not triggered. And Assert is not 
compiled

in production code.

Attached patch fixes it with removing condition and type casting:

/* now set size */
tb->size = size;
tb->sizemask = (uint32)(size - 1);


OOPS

While writting this letter, I looke at newdata in the frame of 
tuplehash_grow:


newdata = 0x32e0448

It is bellow 4GB border. Allocator does not allocate many-gigabytes 
chunks
(and we certainly need 96GB in this case) in sub 4GB address space. 
Because

mmap doesn't do this.

I went to check SH_GROW and It is `SH_GROW(SH_TYPE *tb, uint32 
newsize)`

:-(((
Therefore when `tb->size == SH_MAX_SIZE/2` and we call `SH_GROW(tb, 
tb->size * 2)`,

then SH_GROW(tb, 0) is called due to truncation.
And SH_COMPUTE_PARAMETERS is also accepts `uint32 newsize`.

Ahh... ok, patch is updated to fix this as well.


regards.

-

Yura Sokolov
y.soko...@postgrespro.ru
funny.fal...@gmail.comFrom a8283d3a17c630a65e1b42f8617e07873a30fbc7 Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Tue, 10 Aug 2021 11:51:16 +0300
Subject: [PATCH] Fix new size and sizemask computaton in simplehash.h

Fix couple of 32/64bit related errors in simplehash.h:
- size of SH_GROW and SH_COMPUTE_PARAMETERS arguments
- computation of tb->sizemask.
---
 src/include/lib/simplehash.h | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index da51781e98e..2287601cfa1 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -302,7 +302,7 @@ SH_SCOPE void SH_STAT(SH_TYPE * tb);
  * the hashtable.
  */
 static inline void
-SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize)
+SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint64 newsize)
 {
 	uint64		size;
 
@@ -322,11 +322,7 @@ SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize)
 
 	/* now set size */
 	tb->size = size;
-
-	if (tb->size == SH_MAX_SIZE)
-		tb->sizemask = 0;
-	else
-		tb->sizemask = tb->size - 1;
+	tb->sizemask = (ui

Re: RFC: Improve CPU cache locality of syscache searches

2021-08-05 Thread Yura Sokolov

Andres Freund писал 2021-08-06 06:49:

Hi,

On 2021-08-06 06:43:55 +0300, Yura Sokolov wrote:
Why don't use simplehash or something like that? Open-addressing 
schemes

show superior cache locality.


I thought about that as well - but it doesn't really resolve the 
question of
what we want to store in-line in the hashtable and what not. We can't 
store
the tuples themselves in the hashtable for a myriad of reasons (need 
pointer
stability, they're variably sized, way too large to move around 
frequently).



Well, simplehash entry will be 24 bytes this way. If simplehash 
template
supports external key/element storage, then it could be shrunk to 16 
bytes,

and syscache entries will not need dlist_node. (But it doesn't at the
moment).


I think storing keys outside of the hashtable entry defeats the purpose 
of the
open addressing, given that they are always checked and that our 
conflict

ratio should be fairly low.


It's opposite: if conflict ratio were high, then key outside of 
hashtable will
be expensive, since lookup to non-matched key will cost excess memory 
access.
But with low conflict ratio we will usually hit matched entry at first 
probe.
And since we will use entry soon, it doesn't matter when it will go to 
CPU L1

cache: during lookup or during actual usage.

regards,
Yura Sokolov




Re: RFC: Improve CPU cache locality of syscache searches

2021-08-05 Thread Yura Sokolov

Andres Freund писал 2021-08-05 23:12:

Hi,

On 2021-08-05 12:27:49 -0400, John Naylor wrote:
On Wed, Aug 4, 2021 at 3:44 PM Andres Freund  
wrote:

> On 2021-08-04 12:39:29 -0400, John Naylor wrote:
> > typedef struct cc_bucket
> > {
> >   uint32 hashes[4];
> >   catctup *ct[4];
> >   dlist_head;
> > };
>
> I'm not convinced that the above the right idea though. Even if the hash
> matches, you're still going to need to fetch at least catctup->keys[0]
from
> a separate cacheline to be able to return the cache entry.

I see your point. It doesn't make sense to inline only part of the
information needed.


At least not for the unconditionally needed information.


Although I'm guessing inlining just two values in the 4-key case 
wouldn't

buy much.


Not so sure about that. I'd guess that two key comparisons take more 
cycles
than a cacheline fetch the further keys (perhaps not if we had inlined 
key
comparisons). I.e. I'd expect out-of-order + speculative execution to 
hide the

latency for fetching the second cacheline for later key values.



> If we stuffed four values into one bucket we could potentially SIMD the
hash
> and Datum comparisons ;)

;-) That's an interesting future direction to consider when we support
building with x86-64-v2. It'd be pretty easy to compare a vector of 
hashes
and quickly get the array index for the key comparisons (ignoring for 
the

moment how to handle the rare case of multiple identical hashes).
However, we currently don't memcmp() the Datums and instead call an
"eqfast" function, so I don't see how that part  would work in a 
vector

setting.


It definitely couldn't work unconditionally - we have to deal with 
text,

oidvector, comparisons after all. But we could use it for the other
types. However, looking at the syscaches, I think it'd not very often 
be

applicable for caches with enough columns.


I have wondered before whether we should have syscache definitions 
generate
code specific to each syscache definition. I do think that'd give a 
good bit
of performance boost. But I don't see a trivial way to get there 
without

notational overhead.

We could define syscaches in a header using a macro that's defined 
differently

in syscache.c than everywhere else. The header would declare a set of
functions for each syscache, syscache.c would define them to call an
always_inline function with the relevant constants.

Or perhaps we should move syscache definitions into the pg_*.h headers, 
and
generate the relevant code as part of their processing. That seems like 
it
could be nice from a modularity POV alone. And it could do better than 
the
current approach, because we could hardcode the types for columns in 
the

syscache definition without increasing notational overhead.


Why don't use simplehash or something like that? Open-addressing schemes
show superior cache locality.

It could be combined: use hashValue as a key in simplehash and dlist for
hashValue collision handling. 99.99% of time there will be no collisions
on hashValue itself, therefore it will be almost always two-three memory
lookups: one-two for dlist_head in simple_hash by hashValue and then
fetching first element in dlist.

And code will remain almost same. Just "bucket" search will change a 
bit.


(And I'd recommend use lower fill factor for this simplehash. At most 
0.85).


Well, simplehash entry will be 24 bytes this way. If simplehash template
supports external key/element storage, then it could be shrunk to 16 
bytes,

and syscache entries will not need dlist_node. (But it doesn't at the
moment).

And custom open-addressing table could be even with 8 bytes per element:
- element is a pointer to entry,
- missing node is encoded as NULL,
- with fill factor as low as 0.66 there will be small amount of 
collisions,
  therefore non-empty entry will be matched entry most of time, and 
memory

  lookup for key comparison will be amortized free.
Note that 8byte entry with fill factor 0.66 consumes amortized 12.12 
byte,

while 16byte entry with fill factor 0.99 consumes amortized 16.16byte.

regards,
Yura Sokolov

y.soko...@postgrespro.ru
funny.fal...@gmail.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-29 Thread Yura Sokolov

Robert Haas писал 2021-07-29 20:15:
On Thu, Jul 29, 2021 at 5:11 AM Masahiko Sawada  
wrote:

Indeed. Given that the radix tree itself has other use cases, I have
no concern about using radix tree for vacuum's dead tuples storage. It
will be better to have one that can be generally used and has some
optimizations that are helpful also for vacuum's use case, rather than
having one that is very optimized only for vacuum's use case.


What I'm about to say might be a really stupid idea, especially since
I haven't looked at any of the code already posted, but what I'm
wondering about is whether we need a full radix tree or maybe just a
radix-like lookup aid. For example, suppose that for a relation <= 8MB
in size, we create an array of 1024 elements indexed by block number.
Each element of the array stores an offset into the dead TID array.
When you need to probe for a TID, you look up blkno and blkno + 1 in
the array and then bsearch only between those two offsets. For bigger
relations, a two or three level structure could be built, or it could
always be 3 levels. This could even be done on demand, so you
initialize all of the elements to some special value that means "not
computed yet" and then fill them the first time they're needed,
perhaps with another special value that means "no TIDs in that block".


8MB relation is not a problem, imo. There is no need to do anything to
handle 8MB relation.

Problem is 2TB relation. It has 256M pages and, lets suppose, 3G dead
tuples.

Then offset array will be 2GB and tuple offset array will be 6GB (2 byte
offset per tuple). 8GB in total.

We can make offset array only for higher 3 bytes of block number.
We then will have 1M offset array weighted 8MB, and there will be array
of 3byte tuple pointers (1 remaining byte from block number, and 2 bytes
from Tuple) weighted 9GB.

But using per-batch compression schemes, there could be amortized
4 byte per page and 1 byte per tuple: 1GB + 3GB = 4GB memory.
Yes, it is not as guaranteed as in array approach. But 95% of time it is
such low and even lower. And better: more tuples are dead - better
compression works. Page with all tuples dead could be encoded as little
as 5 bytes. Therefore, overall memory consumption is more stable and
predictive.

Lower memory consumption of tuple storage means there is less chance
indexes should be scanned twice or more times. It gives more
predictability in user experience.


I don't know if this is better, but I do kind of like the fact that
the basic representation is just an array. It makes it really easy to
predict how much memory will be needed for a given number of dead
TIDs, and it's very DSM-friendly as well.


Whole thing could be encoded in one single array of bytes. Just give
"pointer-to-array"+"array-size" to constructor, and use "bump allocator"
inside. Complex logical structure doesn't imply "DSM-unfriendliness".
Hmm I mean if it is suitably designed.

In fact, my code uses bump allocator internally to avoid "per-allocation
overhead" of "aset", "slab" or "generational". And IntegerSet2 version
even uses it for all allocations since it has no reallocatable parts.

Well, if datastructure has reallocatable parts, it could be less 
friendly

to DSM.

regards,

---
Yura Sokolov
y.soko...@postgrespro.ru
funny.fal...@gmail.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-29 Thread Yura Sokolov

Yura Sokolov писал 2021-07-29 18:29:


I've attached IntegerSet2 patch for pgtools repo and benchmark results.
Branch https://github.com/funny-falcon/pgtools/tree/integerset2


Strange web-mail client... I never can be sure what it will attach...

Reattach benchmark results



regards,

Yura Sokolov
y.soko...@postgrespro.ru
funny.fal...@gmail.com
* Test 1
select prepare(
100, -- max block
10, -- # of dead tuples per page
20, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| shuffled_x10
+---++--++---+-
 array  | 57.23 MB  |  0.029 |   96.650 | 572.21 MB  | _ | ___
 intset | 46.88 MB  |  0.095 |   82.797 | 468.67 MB  | _ | ___
 radix  | 40.26 MB  |  0.089 |   12.082 | 401.27 MB  | 0.999 | 196.490
 rtbm   | 64.02 MB  |  0.164 |   14.300 | 512.02 MB  | 1.991 | 189.343
 svtm   | 17.78 MB  |  0.098 |   12.411 | 178.59 MB  | 0.973 | 210.901
 tbm| 96.01 MB  |  0.239 |8.486 | 768.01 MB  | 2.607 | 119.593
 intset2| 17.78 MB  |  0.220 |   19.203 | 175.84 MB  | 2.030 | 320.488

* Test 2
select prepare(
100, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| shuffled_x10
+---++--++---+-
 array  | 57.23 MB  |  0.030 |4.737 | 572.21 MB  | _ | ___
 intset | 46.88 MB  |  0.096 |4.077 | 468.67 MB  | _ | ___
 radix  | 9.95 MB   |  0.051 |0.752 | 98.38 MB   | 0.531 |  18.642
 rtbm   | 34.02 MB  |  0.166 |0.424 | 288.02 MB  | 2.036 |   6.294
 svtm   | 5.78 MB   |  0.037 |0.203 | 54.60 MB   | 0.400 |   5.682
 tbm| 96.01 MB  |  0.245 |0.423 | 768.01 MB  | 2.624 |   5.848
 intset2| 6.27 MB   |  0.097 |0.414 | 61.29 MB   | 1.002 |  12.823

* Test 3
select prepare(
100, -- max block
2, -- # of dead tuples per page
100, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| shuffled_x10
+---++--++---+-
 array  | 11.45 MB  |  0.006 |   53.623 | 114.45 MB  | _ | ___
 intset | 15.63 MB  |  0.024 |   56.291 | 156.23 MB  | _ | ___
 radix  | 40.26 MB  |  0.055 |   10.928 | 401.27 MB  | 0.622 | 173.953
 rtbm   | 36.02 MB  |  0.116 |8.452 | 320.02 MB  | 1.370 | 125.438
 svtm   |  9.28 MB  |  0.069 |6.780 | 102.10 MB  | 0.690 | 167.220
 tbm| 96.01 MB  |  0.188 |8.569 | 768.01 MB  | 1.935 | 118.841
 intset2|  4.28 MB  |  0.022 |5.624 |  42.04 MB  | 0.229 | 146.096

* Test 4
select prepare(
100, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| shuffled_x10
+---++--++---+-
 array  | 572.21 MB |  0.304 |   76.625 | 5722.05 MB | _ | ___
 intset | 93.74 MB  |  0.795 |   47.977 | 937.34 MB  | _ | ___
 radix  | 40.26 MB  |  0.264 |6.322 | 401.27 MB  | 2.674 | 101.887
 rtbm   | 36.02 MB  |  0.360 |4.195 | 320.02 MB  | 4.009 |  62.441
 svtm   | 7.28 MB   |  0.329 |2.729 | 73.60 MB   | 3.271 |  74.725
 tbm| 96.01 MB  |  0.564 |4.220 | 768.01 MB  | 5.711 |  59.047
 svtm   | 7.28 MB   |  0.908 |5.826 | 70.55 MB   | 9.089 | 137.902


* Test 5
select prepare(
100, -- max block
150, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
2 -- page interval
);

There are 1 consecutive pages that have 150 dead tuples at every
2 pages.

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| shuffled_x10
+---++--++---+-
 array  | 429.16 MB |  0.225 |   65.404 | 4291.54 MB | _ | ___
 intset | 46.88 MB  |  0.510 |   39.460 | 468.67 MB  | _ | ___
 radix  | 20.26 MB  |  0.192 |6.440 | 201.48 MB  | 1.954 | 111.539
 rtbm   | 18.02 MB  |  0.225 |7.077 | 160.02 MB  | 2.515 | 120.816
 svtm   | 3.66 MB   |  0.233 |3.493 | 37.10 MB   | 2.348 |  73.092
 tbm| 48.01 MB  |  0.341 |9.146 | 384.01 MB  | 3.544 | 161.435
 intset2| 3.78 MB   |  0.671 |5.081 | 35.29 MB   | 6.791 | 117.617

* Test 6
select prepare(
100, -- max block
10

Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-29 Thread Yura Sokolov

Masahiko Sawada писал 2021-07-29 17:29:
On Thu, Jul 29, 2021 at 8:03 PM Yura Sokolov  
wrote:


Masahiko Sawada писал 2021-07-29 12:11:
> On Thu, Jul 29, 2021 at 3:53 AM Andres Freund 
> wrote:
>>
>> Hi,
>>
>> On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:
>> > Apart from performance and memory usage points of view, we also need
>> > to consider the reusability of the code. When I started this thread, I
>> > thought the best data structure would be the one optimized for
>> > vacuum's dead tuple storage. However, if we can use a data structure
>> > that can also be used in general, we can use it also for other
>> > purposes. Moreover, if it's too optimized for the current TID system
>> > (32 bits block number, 16 bits offset number, maximum block/offset
>> > number, etc.) it may become a blocker for future changes.
>>
>> Indeed.
>>
>>
>> > In that sense, radix tree also seems good since it can also be used in
>> > gist vacuum as a replacement for intset, or a replacement for hash
>> > table for shared buffer as discussed before. Are there any other use
>> > cases?
>>
>> Yes, I think there are. Whenever there is some spatial locality it has
>> a
>> decent chance of winning over a hash table, and it will most of the
>> time
>> win over ordered datastructures like rbtrees (which perform very
>> poorly
>> due to the number of branches and pointer dispatches). There's plenty
>> hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
>> degree of locality, so I'd expect a few potential uses. When adding
>> "tree compression" (i.e. skip inner nodes that have a single incoming
>> &
>> outgoing node) radix trees even can deal quite performantly with
>> variable width keys.
>
> Good point.
>
>>
>> > On the other hand, I’m concerned that radix tree would be an
>> > over-engineering in terms of vacuum's dead tuples storage since the
>> > dead tuple storage is static data and requires only lookup operation,
>> > so if we want to use radix tree as dead tuple storage, I'd like to see
>> > further use cases.
>>
>> I don't think we should rely on the read-only-ness. It seems pretty
>> clear that we'd want parallel dead-tuple scans at a point not too far
>> into the future?
>
> Indeed. Given that the radix tree itself has other use cases, I have
> no concern about using radix tree for vacuum's dead tuples storage. It
> will be better to have one that can be generally used and has some
> optimizations that are helpful also for vacuum's use case, rather than
> having one that is very optimized only for vacuum's use case.

Main portion of svtm that leads to memory saving is compression of 
many
pages at once (CHUNK). It could be combined with radix as a storage 
for

pointers to CHUNKs., bute

For a moment I'm benchmarking IntegerSet replacement based on Trie 
(HATM

like)
and CHUNK compression, therefore datastructure could be used for gist
vacuum as well.

Since it is generic (allows to index all 64bit) it lacks of trick used
to speedup svtm. Still on 10x test it is faster than radix.


I've attached IntegerSet2 patch for pgtools repo and benchmark results.
Branch https://github.com/funny-falcon/pgtools/tree/integerset2

SVTM is measured with couple of changes from commit 
5055ef72d23482dd3e11ce

in that branch: 1) more often compress bitmap, but slower, 2) couple of
popcount tricks.

IntegerSet consists of trie index to CHUNKS. CHUNKS is compressed bitmap
of 2^15 (6+9) bits (almost like in SVTM, but for fixed bit width).

Well, IntegerSet2 is always faster than IntegerSet and always uses
significantly less memory (radix uses more memory than IntegerSet in
couple of tests and uses comparable in others).

IntegerSet2 is not always faster than radix. It is more like radix.
That it because both are generic prefix trees with comparable amount of
memory accesses. SVTM did the trick being not multilevel prefix tree, 
but

just 1 level bitmap index to chunks.

I believe, trie part of IntegerSet could be replaced with radix.
Ie use radix as storage for pointers to CHUNKS.


BTW, how does svtm work when we add two sets of dead tuple TIDs to one
svtm? Dead tuple TIDs are unique sets but those sets could have TIDs
of the different offsets on the same block. The case I imagine is the
idea discussed on this thread[1]. With this idea, we store the
collected dead tuple TIDs somewhere and skip index vacuuming for some
reason (index skipping optimization, failsafe mode, or interruptions
etc.). Then, in the next lazy vacuum timing, we load the dead tuple
TIDs and start to scan the heap. During the heap scan in the second
lazy vacuum, it's possible that new dead tu

Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-29 Thread Yura Sokolov

Masahiko Sawada писал 2021-07-29 12:11:
On Thu, Jul 29, 2021 at 3:53 AM Andres Freund  
wrote:


Hi,

On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:
> Apart from performance and memory usage points of view, we also need
> to consider the reusability of the code. When I started this thread, I
> thought the best data structure would be the one optimized for
> vacuum's dead tuple storage. However, if we can use a data structure
> that can also be used in general, we can use it also for other
> purposes. Moreover, if it's too optimized for the current TID system
> (32 bits block number, 16 bits offset number, maximum block/offset
> number, etc.) it may become a blocker for future changes.

Indeed.


> In that sense, radix tree also seems good since it can also be used in
> gist vacuum as a replacement for intset, or a replacement for hash
> table for shared buffer as discussed before. Are there any other use
> cases?

Yes, I think there are. Whenever there is some spatial locality it has 
a
decent chance of winning over a hash table, and it will most of the 
time
win over ordered datastructures like rbtrees (which perform very 
poorly

due to the number of branches and pointer dispatches). There's plenty
hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
degree of locality, so I'd expect a few potential uses. When adding
"tree compression" (i.e. skip inner nodes that have a single incoming 
&

outgoing node) radix trees even can deal quite performantly with
variable width keys.


Good point.



> On the other hand, I’m concerned that radix tree would be an
> over-engineering in terms of vacuum's dead tuples storage since the
> dead tuple storage is static data and requires only lookup operation,
> so if we want to use radix tree as dead tuple storage, I'd like to see
> further use cases.

I don't think we should rely on the read-only-ness. It seems pretty
clear that we'd want parallel dead-tuple scans at a point not too far
into the future?


Indeed. Given that the radix tree itself has other use cases, I have
no concern about using radix tree for vacuum's dead tuples storage. It
will be better to have one that can be generally used and has some
optimizations that are helpful also for vacuum's use case, rather than
having one that is very optimized only for vacuum's use case.


Main portion of svtm that leads to memory saving is compression of many
pages at once (CHUNK). It could be combined with radix as a storage for
pointers to CHUNKs.

For a moment I'm benchmarking IntegerSet replacement based on Trie (HATM 
like)

and CHUNK compression, therefore datastructure could be used for gist
vacuum as well.

Since it is generic (allows to index all 64bit) it lacks of trick used
to speedup svtm. Still on 10x test it is faster than radix.

I'll send result later today after all benchmarks complete.

And I'll try then to make mix of radix and CHUNK compression.


During the performance benchmark, I found some bugs in the radix tree
implementation.


There is a bug in radix_to_key_off as well:

tid_i |= ItemPointerGetBlockNumber(tid) << shift;

ItemPointerGetBlockNumber returns uint32, therefore result after shift
is uint32 as well.

It leads to lesser memory consumption (and therefore better times) on
10x test, when page number exceed 2^23 (8M). It still produce "correct"
result for test since every page is filled in the same way.

Could you push your fixes for radix, please?

regards,
Yura Sokolov

y.soko...@postgrespro.ru
funny.fal...@gmail.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-27 Thread Yura Sokolov
 MB  |  0.344 |9.763 | 384.01 MB  | 3.327 | 
151.824


* Test 6
select prepare(
100, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
2 -- page interval
);

There are 1 consecutive pages that have 10 dead tuples at every 
2 pages.


  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| 
shuffled_x10

+---++--++---+-
 array  | 28.62 MB  |  0.022 |2.791 | 286.11 MB  | 0.170 |  
46.920
 intset | 23.45 MB  |  0.061 |2.156 | 234.34 MB  | 0.501 |  
32.577
 radix  | 5.04 MB   |  0.026 |0.433 | 48.57 MB   | 0.191 |  
11.060
 rtbm   | 17.02 MB  |  0.074 |0.533 | 144.02 MB  | 0.954 |  
11.502
 svtm   | 3.16 MB   |  0.023 |0.206 | 27.60 MB   | 0.175 |  
 4.886
 tbm| 48.01 MB  |  0.132 |0.656 | 384.01 MB  | 1.284 |  
10.231


* Test 7
select prepare(
100, -- max block
150, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1000, -- # of consecutive pages having dead tuples
999000 -- page interval
);

There are pages that have 150 dead tuples at first 1000 blocks and
last 1000 blocks.

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| 
shuffled_x10

+---++--++---+-
 array  | 1.72 MB   |  0.002 |7.507 | 17.17 MB   | 0.011 |  
76.510
 intset | 0.20 MB   |  0.003 |6.742 | 1.89 MB| 0.022 |  
52.122
 radix  | 0.20 MB   |  0.001 |1.023 | 1.07 MB| 0.007 |  
12.023
 rtbm   | 0.15 MB   |  0.001 |2.637 | 0.65 MB| 0.009 |  
34.528
 svtm   | 0.52 MB   |  0.002 |0.721 | 0.61 MB| 0.010 |  
 6.434
 tbm| 0.20 MB   |  0.002 |2.733 | 1.51 MB| 0.015 |  
38.538


* Test 8
select prepare(
100, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within  a page
50, -- # of consecutive pages having dead tuples
100 -- page interval
);

There are 50 consecutive pages that have 100 dead tuples at every 100 
pages.


  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| 
shuffled_x10

+---++--++---+-
 array  | 286.11 MB |  0.184 |   67.233 | 2861.03 MB | 1.743 | 
979.070
 intset | 46.88 MB  |  0.389 |   35.176 | 468.67 MB  | 3.698 | 
505.322
 radix  | 21.82 MB  |  0.116 |6.160 | 186.86 MB  | 0.891 | 
117.730
 rtbm   | 18.02 MB  |  0.182 |5.909 | 160.02 MB  | 1.870 | 
112.550
 svtm   | 4.28 MB   |  0.152 |3.213 | 37.60 MB   | 1.383 |  
79.073
 tbm| 48.01 MB  |  0.265 |6.673 | 384.01 MB  | 2.586 | 
101.327


Overall, 'svtm' is faster and consumes less memory. 'radix' tree also
has good performance and memory usage.

From these results, svtm is the best data structure among proposed
ideas for dead tuple storage used during lazy vacuum in terms of
performance and memory usage. I think it can support iteration by
extracting the offset of dead tuples for each block while iterating
chunks.

Apart from performance and memory usage points of view, we also need
to consider the reusability of the code. When I started this thread, I
thought the best data structure would be the one optimized for
vacuum's dead tuple storage. However, if we can use a data structure
that can also be used in general, we can use it also for other
purposes. Moreover, if it's too optimized for the current TID system
(32 bits block number, 16 bits offset number, maximum block/offset
number, etc.) it may become a blocker for future changes.

In that sense, radix tree also seems good since it can also be used in
gist vacuum as a replacement for intset, or a replacement for hash
table for shared buffer as discussed before. Are there any other use
cases? On the other hand, I’m concerned that radix tree would be an
over-engineering in terms of vacuum's dead tuples storage since the
dead tuple storage is static data and requires only lookup operation,
so if we want to use radix tree as dead tuple storage, I'd like to see
further use cases.


I can evolve svtm to transparent intset replacement certainly. Using
same trick from radix_to_key it will store tids efficiently:

  shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
  tid_i = ItemPointerGetOffsetNumber(tid);
  tid_i |= ItemPointerGetBlockNumber(tid) << shift;

Will do today's evening.

regards
Yura Sokolov aka funny_falcon




Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-25 Thread Yura Sokolov
63ms42MB6.05s
svtm   360ms 8MB3.49s

Therefore Specialized Vaccum Tid Map always consumes least memory amount
and usually faster.


(I've applied Andres's patch for slab allocator before testing)

Attached patch is against 6753911a444e12e4b55 commit of your pgtools 
with

applied Andres's patches for radix method.

I've also pushed it to github:
https://github.com/funny-falcon/pgtools/tree/svtm/bdbench

regards,
Yura SokolovFrom 3a6c96cc705b1af412cf9300be6f676f6c5e4aa6 Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Sun, 25 Jul 2021 03:06:48 +0300
Subject: [PATCH] svtm - specialized vacuum tid map

---
 bdbench/Makefile  |   2 +-
 bdbench/bdbench.c |  91 ++-
 bdbench/bench.sql |   2 +
 bdbench/svtm.c| 635 ++
 bdbench/svtm.h|  19 ++
 5 files changed, 746 insertions(+), 3 deletions(-)
 create mode 100644 bdbench/svtm.c
 create mode 100644 bdbench/svtm.h

diff --git a/bdbench/Makefile b/bdbench/Makefile
index 723132a..a6f758f 100644
--- a/bdbench/Makefile
+++ b/bdbench/Makefile
@@ -2,7 +2,7 @@
 
 MODULE_big = bdbench
 DATA = bdbench--1.0.sql
-OBJS = bdbench.o vtbm.o rtbm.o radix.o
+OBJS = bdbench.o vtbm.o rtbm.o radix.o svtm.o
 
 EXTENSION = bdbench
 REGRESS= bdbench
diff --git a/bdbench/bdbench.c b/bdbench/bdbench.c
index 85d8eaa..a8bc49a 100644
--- a/bdbench/bdbench.c
+++ b/bdbench/bdbench.c
@@ -7,6 +7,7 @@
 
 #include "postgres.h"
 
+#include 
 #include "catalog/index.h"
 #include "fmgr.h"
 #include "funcapi.h"
@@ -20,6 +21,7 @@
 #include "vtbm.h"
 #include "rtbm.h"
 #include "radix.h"
+#include "svtm.h"
 
 //#define DEBUG_DUMP_MATCHED 1
 
@@ -148,6 +150,15 @@ static bool radix_reaped(LVTestType *lvtt, ItemPointer itemptr);
 static Size radix_mem_usage(LVTestType *lvtt);
 static void radix_load(void *tbm, ItemPointerData *itemptrs, int nitems);
 
+/* svtm */
+static void svtm_init(LVTestType *lvtt, uint64 nitems);
+static void svtm_fini(LVTestType *lvtt);
+static void svtm_attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk,
+		 BlockNumber maxblk, OffsetNumber maxoff);
+static bool svtm_reaped(LVTestType *lvtt, ItemPointer itemptr);
+static Size svtm_mem_usage(LVTestType *lvtt);
+static void svtm_load(SVTm *tbm, ItemPointerData *itemptrs, int nitems);
+
 
 /* Misc functions */
 static void generate_index_tuples(uint64 nitems, BlockNumber minblk,
@@ -174,7 +185,7 @@ static void load_rtbm(RTbm *vtbm, ItemPointerData *itemptrs, int nitems);
 		.mem_usage_fn = n##_mem_usage, \
 			}
 
-#define TEST_SUBJECT_TYPES 6
+#define TEST_SUBJECT_TYPES 7
 static LVTestType LVTestSubjects[TEST_SUBJECT_TYPES] =
 {
 	DECLARE_SUBJECT(array),
@@ -182,7 +193,8 @@ static LVTestType LVTestSubjects[TEST_SUBJECT_TYPES] =
 	DECLARE_SUBJECT(intset),
 	DECLARE_SUBJECT(vtbm),
 	DECLARE_SUBJECT(rtbm),
-	DECLARE_SUBJECT(radix)
+	DECLARE_SUBJECT(radix),
+	DECLARE_SUBJECT(svtm)
 };
 
 static bool
@@ -756,6 +768,81 @@ radix_load(void *tbm, ItemPointerData *itemptrs, int nitems)
 	}
 }
 
+/*  svtm --- */
+static void
+svtm_init(LVTestType *lvtt, uint64 nitems)
+{
+	MemoryContext old_ctx;
+
+	lvtt->mcxt = AllocSetContextCreate(TopMemoryContext,
+	   "svtm bench",
+	   ALLOCSET_DEFAULT_SIZES);
+	old_ctx = MemoryContextSwitchTo(lvtt->mcxt);
+	lvtt->private = svtm_create();
+	MemoryContextSwitchTo(old_ctx);
+}
+
+static void
+svtm_fini(LVTestType *lvtt)
+{
+	if (lvtt->private != NULL)
+		svtm_free(lvtt->private);
+}
+
+static void
+svtm_attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk,
+			   BlockNumber maxblk, OffsetNumber maxoff)
+{
+	MemoryContext oldcontext = MemoryContextSwitchTo(lvtt->mcxt);
+
+	svtm_load(lvtt->private,
+			   DeadTuples_orig->itemptrs,
+			   DeadTuples_orig->dtinfo.nitems);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+static bool
+svtm_reaped(LVTestType *lvtt, ItemPointer itemptr)
+{
+	return svtm_lookup(lvtt->private, itemptr);
+}
+
+static uint64
+svtm_mem_usage(LVTestType *lvtt)
+{
+	svtm_stats((SVTm *) lvtt->private);
+	return MemoryContextMemAllocated(lvtt->mcxt, true);
+}
+
+static void
+svtm_load(SVTm *svtm, ItemPointerData *itemptrs, int nitems)
+{
+	BlockNumber curblkno = InvalidBlockNumber;
+	OffsetNumber offs[1024];
+	int noffs = 0;
+
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemPointer tid = &(itemptrs[i]);
+		BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+
+		if (curblkno != InvalidBlockNumber &&
+			curblkno != blkno)
+		{
+			svtm_add_page(svtm, curblkno, offs, noffs);
+			curblkno = blkno;
+			noffs = 0;
+		}
+
+		curblkno = blkno;
+		offs[noffs++] = ItemPointerGetOffsetNumber(tid);
+	}
+
+	svtm_add_page(svtm, curblkno, offs, noffs);
+	svtm_finalize_addition(svtm);
+}
+
 
 static void
 attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk, BlockNumber maxblk,
diff --git a/bdbench/bench.sql b/bdbench/bench.sql
index 94cfde0..b30

Re: rand48 replacement

2021-07-06 Thread Yura Sokolov

Fabien COELHO писал 2021-07-06 23:49:

Hello Yura,


However, I'm not enthousiastic at combining two methods depending on
the range, the function looks complex enough without that, so I would
suggest not to take this option. Also, the decision process adds to
the average cost, which is undesirable.


Given 99.99% cases will be in the likely case, branch predictor should
eliminate decision cost.


Hmmm. ISTM that a branch predictor should predict that unknown < small
should probably be false, so a hint should be given that it is really
true.


Why? Branch predictor is history based: if it were usually true here
then it will be true this time either.
unknown < small is usually true therefore branch predictor will assume
it is true.

I put `likely` for compiler: compiler then puts `likely` path closer.



And as Dean Rasheed correctly mentioned, mask method will have really 
bad pattern for branch predictor if range is not just below or equal 
to power of two.


On average the bitmask is the better unbiased method, if the online
figures are to be trusted. Also, as already said, I do not really want
to add code complexity, especially to get lower average performance,
and especially with code like "threshold = - range % range", where
both variables are unsigned, I have a headache just looking at it:-)


If you mention https://www.pcg-random.org/posts/bounded-rands.html then
1. first graphs are made with not exact Lemire's code.
  Last code from 
https://lemire.me/blog/2016/06/30/fast-random-shuffling/
  (which I derived from) performs modulo operation only if `(leftover < 
range)`.
  Even with `rand_range(100)` it is just once in four thousands 
runs.

2. You can see "Small-Constant Benchmark" at that page, Debiased Int is
  1.5 times faster. And even in "Small-Shuffle" benchmark their 
unoptimized

  version is on-par with mask method.
3. If you go to "Making Improvements/Faster Threshold-Based Discarding"
  section, then you'll see code my version is matched with. It is twice
  faster than masked method in Small-Shuffle benchmark, and just a bit
  slower in Large-Shuffle.




And __builtin_clzl is not free lunch either, it has latency 3-4 cycles
on modern processor.


Well, % is not cheap either.

Well, probably it could run in parallel with some part of xoroshiro, 
but it depends on how compiler will optimize this function.


I would certainly select the unbias multiply method if we want a u32 
range variant.


There could be two functions.


Yep, but do we need them? Who is likely to want 32 bits pseudo random
ints in a range? pgbench needs 64 bits.

So I'm still inclined to just keep the faster-on-average bitmask
method, despite that it may be slower for some ranges. The average
cost for the worst case in PRNG calls is, if I'm not mistaken:

  1 * 0.5 + 2 * 0.25 + 3 * 0.125 + ... ~ 2

which does not look too bad.


You doesn't count cost of branch-misprediction.
https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array
https://lemire.me/blog/2019/10/15/mispredicted-branches-can-multiply-your-running-times/
Therefore calculation should be at least:

   1 * 0.5 + 0.5 * (3 + 0.5 * (3 + ...)) = 6

By the way, we have 64bit random. If we use 44bit from it for range <= 
(1<<20), then
bias will be less than 1/(2**24). Could we just ignore it (given it is 
not crypto

strong random)?

uint64 pg_prng_u64_range(pg_prng_state *state, uint64 range)
{
  uint64 val = xoroshiro128ss(state);
  uint64 m;
  if ((range & (range-1) == 0) /* handle all power 2 cases */
return range != 0 ? val & (range-1) : 0;
  if (likely(range < (1<<20)))
 /*
  * While multiply method is biased, bias will be smaller than 
1/(1<<24) for

  * such small ranges. Lets ignore it.
  */
 return ((val >> 20) * range) >> 44;
  /* Apple's mask method */
  m = mask_u64(range-1);
  val &= m;
  while (val >= range)
val = xoroshiro128ss(state) & m;
  return val;
}

Anyway, excuse me for heating this discussion cause of such 
non-essential issue.

I'll try to control myself and don't proceed it further.

regards,
Sokolov Yura.




Re: rand48 replacement

2021-07-06 Thread Yura Sokolov

Fabien COELHO писал 2021-07-06 09:13:

Hello Yura,


I believe most "range" values are small, much smaller than UINT32_MAX.
In this case, according to [1] fastest method is Lemire's one (I'd 
take

original version from [2]) [...]


Yep.

I share your point that the range is more often 32 bits.

However, I'm not enthousiastic at combining two methods depending on
the range, the function looks complex enough without that, so I would
suggest not to take this option. Also, the decision process adds to
the average cost, which is undesirable.


Given 99.99% cases will be in the likely case, branch predictor should
eliminate decision cost.

And as Dean Rasheed correctly mentioned, mask method will
have really bad pattern for branch predictor if range is not just below
or equal to power of two.
For example, rand_range(1) will have 60% probability to pass through
`while (val > range)` and 40% probability to go to next loop iteration.
rand_range(10) will have 76%/24% probabilities. Branch predictor
doesn't like it. Even rand_range(100), which is quite close to 2^20,
will have 95%/5%, and still not enough to please BP.

But with unbias multiply method it will be 0.0002%/99.9998% for 1,
0,002%/99.998% for 10 and 0.02%/99.98% for 100 - much-much 
better.

Branch predictor will make it almost free (i hope).

And __builtin_clzl is not free lunch either, it has latency 3-4 cycles
on modern processor. Well, probably it could run in parallel with some
part of xoroshiro, but it depends on how compiler will optimize this
function.

I would certainly select the unbias multiply method if we want a u32 
range variant.


There could be two functions.

regards,
Sokolov Yura.




Re: rand48 replacement

2021-07-05 Thread Yura Sokolov

Fabien COELHO писал 2021-07-04 23:29:
The important property of determinism is that if I set a seed, and 
then make an identical set of calls to the random API, the results 
will be identical every time, so that it's possible to write tests 
with predictable/repeatable results.


Hmmm… I like my stronger determinism definition more than this one:-)

I would work around that by deriving another 128 bit generator 
instead of splitmix 64 bit, but that is overkill.


Not really relevant now, but I'm pretty sure that's impossible to do.
You might try it as an interesting academic exercise, but I believe
it's a logical impossibility.


Hmmm… I was simply thinking of seeding a new pg_prng_state from the
main pg_prng_state with some transformation, and then iterate over the
second PRNG, pretty much like I did with splitmix, but with 128 bits
so that your #states argument does not apply, i.e. something like:

 /* select in a range with bitmask rejection */
 uint64 pg_prng_u64_range(pg_prng_state *state, uint64 range)
 {
/* always advance state once */
uint64 next = xoroshiro128ss(state);
uint64 val;

if (range >= 2)
{
uint64 mask = mask_u64(range-1);

val = next & mask;

if (val >= range)
{
/* copy and update current prng state */
pg_prng_state iterator = *state;

iterator.s0 ^= next;
iterator.s1 += UINT64CONST(0x9E3779B97f4A7C15);

/* iterate till val in [0, range) */
while ((val = xoroshiro128ss() & mask) >= range)
;
}
}
else
val = 0;

return val;
 }

The initial pseudo-random sequence is left to proceed, and a new PRNG
is basically forked for iterating on the mask, if needed.


I believe most "range" values are small, much smaller than UINT32_MAX.
In this case, according to [1] fastest method is Lemire's one (I'd take
original version from [2])

Therefore combined method pg_prng_u64_range could branch on range value

uint64 pg_prng_u64_range(pg_prng_state *state, uint64 range)
{
  uint64 val = xoroshiro128ss(state);
  uint64 m;
  if ((range & (range-1) == 0) /* handle all power 2 cases */
return range != 0 ? val & (range-1) : 0;
  if (likely(range < PG_UINT32_MAX/32)
  {
/*
 * Daniel Lamire's unbiased range random algorithm based on 
rejection sampling

 * https://lemire.me/blog/2016/06/30/fast-random-shuffling/
 */
m = (uint32)val * range;
if ((uint32)m < range)
{
  uint32 t = -range % range;
  while ((uint32)m < t)
m = (uint32)xoroshiro128ss(state) * range;
}
return m >> 32;
  }
  /* Apple's mask method */
  m = mask_u64(range-1);
  val &= m;
  while (val >= range)
val = xoroshiro128ss(state) & m;
  return val;
}

Mask method could also be faster when range is close to mask.
For example, fast check for "range is within 1/1024 to mask" is
  range < (range/512 + (range&(range*2)))

And then method choose could like:
  if (likely(range < UINT32_MAX/32 && range > (range/512 + 
(range&(range*2)


But I don't know does additional condition worth difference or not.

[1] https://www.pcg-random.org/posts/bounded-rands.html
[2] https://lemire.me/blog/2016/06/30/fast-random-shuffling/

regards,
Sokolov Yura




Re: rand48 replacement

2021-07-03 Thread Yura Sokolov

Fabien COELHO wrote 2021-07-03 11:45:

And a v5 where an unused test file does also compile if we insist.


About patch:
1. PostgreSQL source uses `uint64` and `uint32`, but not 
`uint64_t`/`uint32_t`
2. I don't see why pg_prng_state could not be `typedef uint64 
pg_prng_state[2];`

3. Then SamplerRandomState and pgbench RandomState could stay.
   Patch will be a lot shorter.
   I don't like mix of semantic refactoring and syntactic refactoring in 
the

   same patch.
   While I could agree with replacing `SamplerRandomState => 
pg_prng_state`, I'd

   rather see it in separate commit.
   And that separate commit could contain transition:
   `typedef uint64 pg_prng_state[2];` => `typedef struct { uint64 s0, s1 
} pg_prng_state;`
4. There is no need in ReservoirStateData->randstate_initialized. There 
could

   be macros/function:
   `bool pg_prng_initiated(state) { return (state[0]|state[1]) != 0; }`
5. Is there need for 128bit prng at all? At least 2*64bit.
   There are 2*32bit xoroshiro64 
https://prng.di.unimi.it/xoroshiro64starstar.c
   And there is 4*32bit xoshiro128: 
https://prng.di.unimi.it/xoshiro128plusplus.c

   32bit operations are faster on 32bit platforms.
   But 32bit platforms are quite rare in production this days.
   Therefore I don't have strong opinion on this.

regards,
Sokolov Yura.




Re: Extensible storage manager API - smgr hooks

2021-06-29 Thread Yura Sokolov

Anastasia Lubennikova писал 2021-06-30 00:49:

Hi, hackers!

Many recently discussed features can make use of an extensible storage
manager API. Namely, storage level compression and encryption [1],
[2], [3], disk quota feature [4], SLRU storage changes [5], and any
other features that may want to substitute PostgreSQL storage layer
with their implementation (i.e. lazy_restore [6]).

Attached is a proposal to change smgr API to make it extensible.  The
idea is to add a hook for plugins to get control in smgr and define
custom storage managers. The patch replaces smgrsw[] array and smgr_sw
selector with smgr() function that loads f_smgr implementation.

As before it has only one implementation - smgr_md, which is wrapped
into smgr_standard().

To create custom implementation, a developer needs to implement smgr
API functions
static const struct f_smgr smgr_custom =
{
.smgr_init = custominit,
...
}

create a hook function

   const f_smgr * smgr_custom(BackendId backend, RelFileNode rnode)
  {
  //Here we can also add some logic and chose which smgr to use
based on rnode and backend
  return _custom;
  }

and finally set the hook:
smgr_hook = smgr_custom;

[1]
https://www.postgresql.org/message-id/flat/11996861554042...@iva4-dd95b404a60b.qloud-c.yandex.net
[2]
https://www.postgresql.org/message-id/flat/272dd2d9.e52a.17235f2c050.Coremail.chjischj%40163.com
[3] https://postgrespro.com/docs/enterprise/9.6/cfs
[4]
https://www.postgresql.org/message-id/flat/CAB0yre%3DRP_ho6Bq4cV23ELKxRcfhV2Yqrb1zHp0RfUPEWCnBRw%40mail.gmail.com
[5]
https://www.postgresql.org/message-id/flat/20180814213500.GA74618%4060f81dc409fc.ant.amazon.com
[6]
https://wiki.postgresql.org/wiki/PGCon_2021_Fun_With_WAL#Lazy_Restore

--

Best regards,
Lubennikova Anastasia


Good day, Anastasia.

I also think smgr should be extended with different implementations 
aside of md.
But which way concrete implementation will be chosen for particular 
relation?
I believe it should be (immutable!) property of tablespace, and should 
be passed
to smgropen. Patch in current state doesn't show clear way to distinct 
different

implementations per relation.

I don't think patch should be that invasive. smgrsw could pointer to
array instead of static array as it is of now, and then reln->smgr_which
will remain with same meaning. Yep it then will need a way to select 
specific
implementation, but something like `char smgr_name[NAMEDATALEN]` field 
with

linear search in (i believe) small smgrsw array should be enough.

Maybe I'm missing something?

regards,
Sokolov Yura.From 90085398f5ecc90d6b7caa318bd3d5f2867ef95c Mon Sep 17 00:00:00 2001
From: anastasia 
Date: Tue, 29 Jun 2021 22:16:26 +0300
Subject: [PATCH] smgr_api.patch

Make smgr API pluggable. Add smgr_hook that can be used to define custom storage managers.
Remove smgrsw[] array and smgr_sw selector. Instead, smgropen() uses smgr() function to load
f_smgr implementation using smgr_hook.

Also add smgr_init_hook and smgr_shutdown_hook.
And a lot of mechanical changes in smgr.c functions.
---
 src/backend/storage/smgr/smgr.c | 136 ++--
 src/include/storage/smgr.h  |  56 -
 2 files changed, 116 insertions(+), 76 deletions(-)

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..5f1981a353 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -26,47 +26,8 @@
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 
-
-/*
- * This struct of function pointers defines the API between smgr.c and
- * any individual storage manager module.  Note that smgr subfunctions are
- * generally expected to report problems via elog(ERROR).  An exception is
- * that smgr_unlink should use elog(WARNING), rather than erroring out,
- * because we normally unlink relations during post-commit/abort cleanup,
- * and so it's too late to raise an error.  Also, various conditions that
- * would normally be errors should be allowed during bootstrap and/or WAL
- * recovery --- see comments in md.c for details.
- */
-typedef struct f_smgr
-{
-	void		(*smgr_init) (void);	/* may be NULL */
-	void		(*smgr_shutdown) (void);	/* may be NULL */
-	void		(*smgr_open) (SMgrRelation reln);
-	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
-bool isRedo);
-	bool		(*smgr_exists) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_unlink) (RelFileNodeBackend rnode, ForkNumber forknum,
-bool isRedo);
-	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
-BlockNumber blocknum, char *buffer, bool skipFsync);
-	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
-  BlockNumber blocknum);
-	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
-			  BlockNumber blocknum, char *buffer);
-	void		(*smgr_write) (SMgrRelation reln, ForkNumber 

Re: Add PortalDrop in exec_execute_message

2021-06-10 Thread Yura Sokolov

Alvaro Herrera wrote 2021-06-08 00:07:

On 2021-May-27, Yura Sokolov wrote:


Alvaro Herrera писал 2021-05-26 23:59:



> I don't think they should do that.  The portal remains open, and the
> libpq interface does that.  The portal gets closed at end of transaction
> without the need for any client message.  I think if the client wanted
> to close the portal ahead of time, it would need a new libpq entry point
> to send the message to do that.

- PQsendQuery issues Query message, and exec_simple_query closes its
  portal.
- people doesn't expect PQsendQueryParams to be different from
  PQsendQuery aside of parameter sending. The fact that the portal
  remains open is a significant, unexpected and undesired difference.
- PQsendQueryGuts is used in PQsendQueryParams and 
PQsendQueryPrepared.

  It is always sends empty portal name and always "send me all rows"
  limit (zero). Both usages are certainly to just perform query and
  certainly no one expects portal remains open.


Thinking about it some more, Yura's argument about PQsendQuery does 
make

sense -- since what we're doing is replacing the use of a 'Q' message
just because we can't use it when in pipeline mode, then it is
reasonable to think that the replacement ought to have the same
behavior.  Upon receipt of a 'Q' message, the portal is closed
automatically, and ISTM that that behavior should be preserved.

That change would not solve the problem he complains about, because 
IIUC

his framework is using PQsendQueryPrepared, which I'm not proposing to
change.  It just removes the other discrepancy that was discussed in 
the

thread.

The attached patch does it.  Any opinions?


I'm propose to change PQsendQueryParams and PQsendQueryPrepared
(through change of PQsendQueryGuts) since they both has semantic
"execute unnamed portal till the end and send me all rows".

Extended protocol were introduced by Tom Lane on 2003-05-05
in 16503e6fa4a13051debe09698b6db9ce0d509af8
This commit already has Close ('C') message.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=16503e6fa4a13051debe09698b6db9ce0d509af8

libpq adoption of extended protocol were made by Tom month later
on 2003-06-23 in efc3a25bb02ada63158fe7006673518b005261ba
and there is already no Close message in PQsendQueryParams.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=efc3a25bb02ada63158fe7006673518b005261ba

I didn't found any relevant discussion in pgsql-hackers on May
and June 2003.

This makes me think, Close message were intended to be used
but simply forgotten when libpq patch were made.

Tom, could I be right?

regards,
Yura.




Re: Clear empty space in a page.

2021-06-01 Thread Yura Sokolov

Hi,

Andres Freund wrote 2021-05-31 00:07:

Hi,

On 2021-05-30 03:10:26 +0300, Yura Sokolov wrote:
While this result is not directly applied to stock PostgreSQL, I 
believe
page compression is important for full_page_writes with 
wal_compression

enabled. And probably when PostgreSQL is used on filesystem with
compression enabled (ZFS?).


I don't think the former is relevant, because the hole is skipped in 
wal page

compression (at some cost).


Ah, forgot about. Yep, you are right.


Therefore I propose clearing page's empty space with zero in
PageRepairFragmentation, PageIndexMultiDelete, PageIndexTupleDelete 
and

PageIndexTupleDeleteNoCompact.

Sorry, didn't measure impact on raw performance yet.


I'm worried that this might cause O(n^2) behaviour in some cases, by
repeatedly memset'ing the same mostly already zeroed space to 0. Why do 
we
ever need to do memset_hole() instead of accurately just zeroing out 
the space

that was just vacated?


It is done exactly this way: memset_hole accepts "old_pd_upper" and 
cleans between

old and new one.

regards,
Yura




Clear empty space in a page.

2021-05-29 Thread Yura Sokolov

Good day.

Long time ago I've been played with proprietary "compressed storage"
patch on heavily updated table, and found empty pages (ie cleaned by
vacuum) are not compressed enough.

When table is stress-updated, page for new row versions are allocated
in round-robin kind, therefore some 1GB segments contains almost
no live tuples. Vacuum removes dead tuples, but segments remains large
after compression (>400MB) as if they are still full.

After some investigation I found it is because PageRepairFragmentation,
PageIndex*Delete* don't clear space that just became empty therefore it
still contains garbage data. Clearing it with memset greatly increase
compression ratio: some compressed relation segments become 30-60MB just
after vacuum remove tuples in them.

While this result is not directly applied to stock PostgreSQL, I believe
page compression is important for full_page_writes with wal_compression
enabled. And probably when PostgreSQL is used on filesystem with
compression enabled (ZFS?).

Therefore I propose clearing page's empty space with zero in
PageRepairFragmentation, PageIndexMultiDelete, PageIndexTupleDelete and
PageIndexTupleDeleteNoCompact.

Sorry, didn't measure impact on raw performance yet.

regards,
Yura Sokolov aka funny_falconcommit 6abfcaeb87fcb396c5e2dccd434ce2511314ff76
Author: Yura Sokolov 
Date:   Sun May 30 02:39:17 2021 +0300

Clear empty space in a page

Write zeroes to just cleared space in PageRepairFragmentation,
PageIndexTupleDelete, PageIndexMultiDelete and PageIndexDeleteNoCompact.

It helps increase compression ration on compression enabled filesystems
and with full_page_write and wal_compression enabled.

diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 82ca91f5977..7deb6cc71a4 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -681,6 +681,17 @@ compactify_tuples(itemIdCompact itemidbase, int nitems, Page page, bool presorte
 	phdr->pd_upper = upper;
 }
 
+/*
+ * Clean up space between pd_lower and pd_upper for better page compression.
+ */
+static void
+memset_hole(Page page, LocationIndex old_pd_upper)
+{
+	PageHeader	phdr = (PageHeader) page;
+	if (phdr->pd_upper > old_pd_upper)
+		MemSet((char *)page + old_pd_upper, 0, phdr->pd_upper - old_pd_upper);
+}
+
 /*
  * PageRepairFragmentation
  *
@@ -797,6 +808,7 @@ PageRepairFragmentation(Page page)
 
 		compactify_tuples(itemidbase, nstorage, page, presorted);
 	}
+	memset_hole(page, pd_upper);
 
 	/* Set hint bit for PageAddItemExtended */
 	if (nunused > 0)
@@ -1114,6 +1126,7 @@ PageIndexTupleDelete(Page page, OffsetNumber offnum)
 
 	if (offset > phdr->pd_upper)
 		memmove(addr + size, addr, offset - phdr->pd_upper);
+	MemSet(addr, 0, size);
 
 	/* adjust free space boundary pointers */
 	phdr->pd_upper += size;
@@ -1271,6 +1284,7 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
 		compactify_tuples(itemidbase, nused, page, presorted);
 	else
 		phdr->pd_upper = pd_special;
+	memset_hole(page, pd_upper);
 }
 
 
@@ -1351,6 +1365,7 @@ PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offnum)
 
 	if (offset > phdr->pd_upper)
 		memmove(addr + size, addr, offset - phdr->pd_upper);
+	MemSet(addr, 0, size);
 
 	/* adjust free space boundary pointer */
 	phdr->pd_upper += size;


Re: Add PortalDrop in exec_execute_message

2021-05-27 Thread Yura Sokolov

Tom Lane wrote 2021-05-27 00:19:

Alvaro Herrera  writes:
(I didn't add a Close Portal message to PQsendQueryInternal in 
pipeline

mode precisely because there is no such message in PQsendQueryGuts.
I think it would be wrong to unconditionally add a Close Portal 
message

to any of those places.)


Yeah, I'm not very comfortable with having libpq take it on itself
to do that, either.


But...

Tom Lane wrote 2021-05-21 21:23:

I'm inclined to think that your complaint would be better handled
by having the client send a portal-close command, if it's not
going to do something else immediately.


And given PQsendQueryParams should not be different from
PQsendQuery (aside of parameters sending) why shouldn't it close
its portal immediately, like it happens in exec_simple_query ?

I really doubt user of PQsendQueryPrepared is aware of portal as
well since it is also unnamed and also exhausted (because
PQsendQueryGuts always sends "send me all rows" limit).

And why PQsendQueryInternal should behave differently in pipelined
and not pipelined mode? It closes portal in not pipelined mode,
and will not close portal of last query in pipelined mode (inside
of transaction).


Looking back at the original complaint, it seems like it'd be fair to
wonder why we're still holding a page pin in a supposedly completed
executor run.  Maybe the right fix is somewhere in the executor
scan logic.


Perhaps because query is simple and portal is created as seek-able?



regards, tom lane


regards
Yura Sokolov




Re: Add PortalDrop in exec_execute_message

2021-05-27 Thread Yura Sokolov

Alvaro Herrera писал 2021-05-26 23:59:

On 2021-May-25, Yura Sokolov wrote:


Tom Lane писал 2021-05-21 21:23:
> Yura Sokolov  writes:
> > I propose to add PortalDrop at the 'if (completed)' branch of
> > exec_execute_message.
>
> This violates our wire protocol specification, which
> specifically says
>
> If successfully created, a named portal object lasts till the end of
> the current transaction, unless explicitly destroyed. An unnamed
> portal is destroyed at the end of the transaction, or as soon as the
> next Bind statement specifying the unnamed portal as destination is
> issued. (Note that a simple Query message also destroys the unnamed
> portal.)
>
> I'm inclined to think that your complaint would be better handled
> by having the client send a portal-close command, if it's not
> going to do something else immediately.

I thought about it as well. Then, if I understand correctly,
PQsendQueryGuts and PQsendQueryInternal in pipeline mode should send
"close portal" (CP) message after "execute" message, right?


I don't think they should do that.  The portal remains open, and the
libpq interface does that.  The portal gets closed at end of 
transaction

without the need for any client message.  I think if the client wanted
to close the portal ahead of time, it would need a new libpq entry 
point

to send the message to do that.


- PQsendQuery issues Query message, and exec_simple_query closes its
  portal.
- people doesn't expect PQsendQueryParams to be different from
  PQsendQuery aside of parameter sending. The fact that the portal
  remains open is a significant, unexpected and undesired difference.
- PQsendQueryGuts is used in PQsendQueryParams and PQsendQueryPrepared.
  It is always sends empty portal name and always "send me all rows"
  limit (zero). Both usages are certainly to just perform query and
  certainly no one expects portal remains open.


(I didn't add a Close Portal message to PQsendQueryInternal in pipeline
mode precisely because there is no such message in PQsendQueryGuts.


But PQsendQueryInternal should replicate behavior of PQsendQuery and
not PQsendQueryParams. Despite it has to use new protocol, it should
be indistinguishable to user, therefore portal should be closed.


I think it would be wrong to unconditionally add a Close Portal message
to any of those places.)


Why? If you foresee problems, please share your mind.

regards,
Sokolov Yura aka funny_falcon




Re: Add PortalDrop in exec_execute_message

2021-05-24 Thread Yura Sokolov

Tom Lane писал 2021-05-21 21:23:

Yura Sokolov  writes:

I propose to add PortalDrop at the 'if (completed)' branch of
exec_execute_message.


This violates our wire protocol specification, which
specifically says

If successfully created, a named portal object lasts till the end 
of

the current transaction, unless explicitly destroyed. An unnamed
portal is destroyed at the end of the transaction, or as soon as 
the

next Bind statement specifying the unnamed portal as destination is
issued. (Note that a simple Query message also destroys the unnamed
portal.)

I'm inclined to think that your complaint would be better handled
by having the client send a portal-close command, if it's not
going to do something else immediately.


I thought about it as well. Then, if I understand correctly,
PQsendQueryGuts and PQsendQueryInternal in pipeline mode should send
"close portal" (CP) message after "execute" message, right?

regards,
Sokolov Yura




Add PortalDrop in exec_execute_message

2021-05-19 Thread Yura Sokolov

Hi, hackers.

I've been playing with "autoprepared" patch, and have got isolation
"freeze-the-dead" test stuck on first VACUUM FREEZE statement.
After some research I found issue is reproduced with unmodified master
branch if extended protocol used. I've prepared ruby script for
demonstration (cause ruby-pg has simple interface to PQsendQueryParams).

Further investigation showed it happens due to portal is not dropped
inside of exec_execute_message, and it is kept in third session till
COMMIT is called. Therefore heap page remains pinned, and VACUUM FREEZE
became locked inside LockBufferForCleanup.

It seems that it is usually invisible to common users since either:
- command is called as standalone and then transaction is closed
  immediately,
- next PQsendQueryParams will initiate another unnamed portal using
  CreatePortal("", true, true) and this action will drop previous
  one.

But "freeze-the-dead" remains locked since third session could not
send COMMIT until VACUUM FULL finished.

I propose to add PortalDrop at the 'if (completed)' branch of
exec_execute_message.

--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2209,6 +2209,8 @@ exec_execute_message(const char *portal_name, long 
max_rows)


if (completed)
{
+   PortalDrop(portal, false);
+
if (is_xact_command)
{

With this change 'make check-world' runs without flaws (at least
on empty configure with enable-cassert and enable-tap-tests).

There is small chance applications exist which abuses seekable
portals with 'execute' protocol message so not every completed
portal can be safely dropped. But I believe there is some sane
conditions that cover common case. For example, isn't empty name
check is enough? Can client reset or seek portal with empty
name?

regards,
Sokolov Yura aka funny_falconrequire 'pg'

c1 = PG.connect(host: "localhost", dbname: "postgres")
c2 = PG.connect(host: "localhost", dbname: "postgres")
c3 = PG.connect(host: "localhost", dbname: "postgres")

class PG::Connection
  def simple(sql)
puts sql
exec(sql)
  end
  def extended(sql)
puts "#{sql}"
exec_params(sql, [])
  end
end

c1.simple "DROP TABLE IF EXISTS tab_freeze" 
c1.simple "
CREATE TABLE tab_freeze (
  id int PRIMARY KEY,
  name char(3),
  x int);
INSERT INTO tab_freeze VALUES (1, '111', 0);
INSERT INTO tab_freeze VALUES (3, '333', 0);
"

c1.simple "BEGIN"
c2.simple "BEGIN"
c3.simple "BEGIN"
c1.simple "UPDATE tab_freeze SET x = x + 1 WHERE id = 3"
c2.extended "SELECT id FROM tab_freeze WHERE id = 3 FOR KEY SHARE"
c3.extended "SELECT id FROM tab_freeze WHERE id = 3 FOR KEY SHARE"
c1.simple "COMMIT"
c2.simple "COMMIT"
c2.simple "VACUUM FREEZE tab_freeze"
c1.simple "
BEGIN;
SET LOCAL enable_seqscan = false;
SET LOCAL enable_bitmapscan = false;
SELECT * FROM tab_freeze WHERE id = 3;
COMMIT;
"
c3.simple "COMMIT"
c2.simple "VACUUM FREEZE tab_freeze"
c1.simple "SELECT * FROM tab_freeze ORDER BY name, id"


  1   2   >