Re: Polyphase merge is obsolete

2021-10-05 Thread John Naylor
On Wed, Sep 15, 2021 at 5:35 PM Heikki Linnakangas  wrote:
> Thanks, here's another rebase.
>
> - Heikki

I've had a chance to review and test out the v5 patches.

0001 is a useful simplification. Nothing in 0002 stood out as needing
comment.

I've done some performance testing of master versus both patches applied.
The full results and test script are attached, but I'll give a summary
here. A variety of value distributions were tested, with work_mem from 1MB
to 16MB, plus 2GB which will not use external sort at all. I settled on 2
million records for the sort, to have something large enough to work with
but also keep the test time reasonable. That works out to about 130MB on
disk. We have recent improvements to datum sort, so I used both single
values and all values in the SELECT list.

The system was on a Westmere-era Xeon with gcc 4.8. pg_prewarm was run on
the input tables. The raw measurements were reduced to the minimum of five
runs.

I can confirm that sort performance is improved with small values of
work_mem. That was not the motivating reason for the patch, but it's a nice
bonus. Even as high as 16MB work_mem, it's possible some of the 4-6%
differences represent real improvement and not just noise or binary
effects, but it's much more convincing at 4MB and below, with 25-30% faster
with non-datum integer sorts at 1MB work_mem. The nominal regressions seem
within the noise level, with one exception that only showed up in one set
of measurements (-10.89% in the spreadsheet). I'm not sure what to make of
that since it only happens in one combination of factors and nowhere else.

On Sat, Sep 11, 2021 at 4:28 AM Zhihong Yu  wrote:
> + * Before PostgreSQL 14, we used the polyphase merge algorithm (Knuth's
> + * Algorithm 5.4.2D),
>
> I think the above 'Before PostgreSQL 14' should be 'Before PostgreSQL 15'
now that PostgreSQL 14 has been released.
>
> +static int64
> +merge_read_buffer_size(int64 avail_mem, int nInputTapes, int nInputRuns,
> +  int maxOutputTapes)
>
> For memory to allocate, I think uint64 can be used (instead of int64).

int64 is used elsewhere in this file, and I see now reason to do otherwise.

Aside from s/14/15/ for the version as noted above, I've marked it Ready
for Committer.

--
John Naylor
EDB: http://www.enterprisedb.com
set -e

ROWS=$1


function log {
	echo `date +%s` [`date +'%Y-%m-%d %H:%M:%S'`] $1
}

function create_tables {

	./inst/bin/psql > /dev/null  < /dev/null  < /dev/null  < /dev/null  < 16384 AND relkind = 'r';
EOF

}

# load test tables

function load_random {

	truncate_tables

	log "loading random"

	./inst/bin/psql > /dev/null  < /dev/null  < /dev/null  < /dev/null  < /dev/null  < /dev/null  < 0.5 
	then ''
	else ''
	end
FROM data_int;
EOF

	prewarm

	vacuum_analyze

}

# run

function run_query {

	times=""
	group=$1
	wmem=$2
	workers=$3
	query=$4

	echo "--" `date` [`date +%s`] >> explains.log
	echo "-- $group rows=$ROWS work_mem=$wmem workers=$workers" >> explains.log

	./inst/bin/psql >> explains.log 2>&1 < /dev/null <> results.csv
}


function run_queries {

	for wm in '1MB' '2MB' '4MB' '8MB' '16MB' '2GB'; do

		log "running queries work_mem=$wm noparallel"

		run_query $1 $wm 0 'SELECT a FROM int_test ORDER BY 1 OFFSET '$ROWS''
		run_query $1 $wm 0 'SELECT * FROM int_test ORDER BY 1 OFFSET '$ROWS''

		run_query $1 $wm 0 'SELECT a FROM txt_test ORDER BY 1 OFFSET '$ROWS''
		run_query $1 $wm 0 'SELECT * FROM txt_test ORDER BY 1 OFFSET '$ROWS''

	done

}

create_tables

log "sort benchmark $ROWS"

load_random
run_queries "random"

load_sorted
run_queries "sorted"

load_almost_sorted
run_queries "almost_sorted"

load_reversed
run_queries "reversed"

load_organ_pipe
run_queries "organ_pipe"

load_0_1
run_queries "0_1"


kway-merge-test-1.ods
Description: application/vnd.oasis.opendocument.spreadsheet


Re: [RFC] building postgres with meson

2021-10-22 Thread John Naylor
On Thu, Oct 21, 2021 at 5:48 PM Andres Freund  wrote:

> However, update-unicode is a bit harder.  Partially not directly because
of
> meson, but because update-unicode as-is afaict doesn't support VPATH
builds,
> and meson enforces those.

> make update-unicode
> ...
> make -C src/common/unicode update-unicode
> '/usr/bin/perl' generate-unicode_norm_table.pl
> Can't open perl script "generate-unicode_norm_table.pl": No such file or
directory
>
> It's not too hard to fix. See attached for the minimal stuff that I
> immediately found to be needed.

Thanks for doing that, it works well enough for demonstration. With your
patch, and using an autoconf VPATH build, the unicode tables work fine, but
it complains of a permission error in generate_unaccent_rules.py. That
seems to be because the script is invoked directly rather than as an
argument to the python interpreter.

> The slightly bigger issue making update-unicode work with meson is that
meson
> doesn't provide support for invoking build targets in specific directories
> (because it doesn't map nicely to e.g. msbuild). But scripts like
> src/common/unicode/generate-unicode_norm_table.pl rely on CWD. It's not
hard
> to work around that, but IMO it's better for such scripts to not rely on
CWD.

Yeah. I encountered a further issue: With autoconf on HEAD, with a source
tree build executed in contrib/unaccent:

$ touch generate_unaccent_rules.py
$ make update-unicode
generate_unaccent_rules.py --unicode-data-file
../../src/common/unicode/UnicodeData.txt --latin-ascii-file Latin-ASCII.xml
>unaccent.rules
/bin/sh: generate_unaccent_rules.py: command not found
make: *** [unaccent.rules] Error 127
make: *** Deleting file `unaccent.rules'

...so in this case it seems not to know to use CWD here.

Anyway, this can be put off until the very end, since it's not run often.
You've demonstrated how these targets would work, and that's good enough
for now.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: speed up verifying UTF-8

2021-12-17 Thread John Naylor
I plan to push v25 early next week, unless there are further comments.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: speed up text_position() for utf-8

2021-12-17 Thread John Naylor
Attached is a short patch series to develop some ideas of inlining
pg_utf_mblen().

0001 puts the main implementation of pg_utf_mblen() into an inline
function and uses this in pg_mblen(). This is somewhat faster in the
strpos tests, so that gives some measure of the speedup expected for
other callers. Text search seems to call this a lot, so this might
have noticeable benefit.

0002 refactors text_position_get_match_pos() to use
pg_mbstrlen_with_len(). This itself is significantly faster when
combined with 0001, likely because the latter can inline the call to
pg_mblen(). The intention is to speed up more than just text_position.

0003 explicitly specializes for the inline version of pg_utf_mblen()
into pg_mbstrlen_with_len(), but turns out to be almost as slow as
master for ascii. It doesn't help if I undo the previous change in
pg_mblen(), and I haven't investigated why yet.

0002 looks good now, but the experience with 0003 makes me hesitant to
propose this seriously until I can figure out what's going on there.

The test is as earlier, a worst-case substring search, times in milliseconds.

 patch  | no match | ascii | multibyte
+--+---+---
 PG11   | 1220 |  1220 |  1150
 master |  385 |  2420 |  1980
 0001   |  390 |  2180 |  1670
 0002   |  389 |  1330 |  1100
 0003   |  391 |  2100 |  1360


-- 
John Naylor
EDB: http://www.enterprisedb.com


v2-0002-Refactor-text_position_get_match_pos-to-use-pg_mb.patch
Description: Binary data


v2-0003-Specialize-pg_mbstrlen_with_len-for-UTF-8.patch
Description: Binary data


v2-0001-Move-the-implementation-of-pg_utf_mblen-to-an-inl.patch
Description: Binary data


Re: do only critical work during single-user vacuum?

2021-12-22 Thread John Naylor
On Tue, Dec 21, 2021 at 10:39 PM Masahiko Sawada  wrote:
>
> On Wed, Dec 22, 2021 at 6:56 AM Peter Geoghegan  wrote:
> >
> > This new command/facility should probably not be a new flag to the
> > VACUUM command, as such. Rather, I think that it should either be an
> > SQL-callable function, or a dedicated top-level command (that doesn't
> > accept any tables). The only reason to have this is for scenarios
> > where the user is already in a tough spot with wraparound failure,
> > like that client of yours. Nobody wants to force the failsafe for one
> > specific table. It's not general purpose, at all, and shouldn't claim
> > to be.
>
> Even not in the situation where the database has to run as the
> single-user mode to freeze tuples, I think there would be some use
> cases where users want to run vacuum (in failsafe mode) on tables with
> relfrozenxid/relminmxid greater than their failsafe thresholds before
> falling into that situation. I think it’s common that users are
> somehow monitoring relfrozenxid/relminmxid and want to manually run
> vacuum on them rather than relying on autovacuums. --min-xid-age
> option and --min-mxid-age option of vacuumdb command would be good
> examples. So I think this new command/facility might not necessarily
> need to be specific to single-user mode.

If we want to leave open the possibility to specify these parameters,
a SQL-callable function seems like the way to go. And even if we
don't, a function is fine.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2021-12-21 Thread John Naylor
On Tue, Dec 21, 2021 at 5:56 PM Peter Geoghegan  wrote:
>
> On Tue, Dec 21, 2021 at 1:31 PM John Naylor
>  wrote:
> > On second thought, we don't really need another number here. We could
> > simply go by the existing failsafe parameter, and if the admin wants a
> > different value, it's already possible to specify
> > vacuum_failsafe_age/vacuum_multixact_failsafe_age in a session,
> > including in single-user mode. Perhaps a new boolean called
> > FAILSAFE_ONLY. If no table is specified, then when generating the list
> > of tables, include only those with relfrozenxid/relminmxid greater
> > than their failsafe thresholds.
>
> That's equivalent to the quick and dirty patch I wrote (assuming that
> the user actually uses this new FAILSAFE_ONLY thing).

Equivalent but optional.

> But if we're going to add a new option to the VACUUM command (or
> something of similar scope), then we might as well add a new behavior
> that is reasonably exact -- something that (say) only *starts* a
> VACUUM for those tables whose relfrozenxid age currently exceeds half
> the autovacuum_freeze_max_age for the table (usually taken from the
> GUC, sometimes taken from the reloption), which also forces the
> failsafe. And with similar handling for
> relminmxid/autovacuum_multixact_freeze_max_age.
>
> In other words, while triggering the failsafe is important, simply *not
> starting* VACUUM for relations where there is really no need for it is
> at least as important. We shouldn't even think about pruning or
> freezing with these tables.

Right, not starting where not necessary is crucial for getting out of
single-user mode as quickly as possible. We're in agreement there.

> (ISTM that the only thing that might be a
> bit controversial about any of this is my definition of "safe", which
> seems like about the right trade-off to me.)

It seems reasonable to want to start back up and not immediately have
anti-wraparound vacuums kick in. On the other hand, it's not good to
do work while unable to monitor progress, and while more WAL is piling
up. I'm not sure where the right trade off is.

> This new command/facility should probably not be a new flag to the
> VACUUM command, as such. Rather, I think that it should either be an
> SQL-callable function, or a dedicated top-level command (that doesn't
> accept any tables). The only reason to have this is for scenarios
> where the user is already in a tough spot with wraparound failure,
> like that client of yours. Nobody wants to force the failsafe for one
> specific table. It's not general purpose, at all, and shouldn't claim
> to be.

Makes sense, I'll have a think about what that would look like.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2021-12-21 Thread John Naylor
On Tue, Dec 21, 2021 at 3:56 AM Masahiko Sawada  wrote:
>
> On Tue, Dec 21, 2021 at 1:53 PM Peter Geoghegan  wrote:
> >
> > On Mon, Dec 20, 2021 at 8:40 PM Masahiko Sawada  
> > wrote:
> > > BTW a vacuum automatically enters failsafe mode under the situation
> > > where the user has to run a vacuum in the single-user mode, right?
> >
> > Only for the table that had the problem. Maybe there are no other
> > tables that a database level "VACUUM" will need to spend much time on,
> > or maybe there are, and they will make it take much much longer (it
> > all depends).
> >
> > The goal of the patch is to make sure that when we're in single user
> > mode, we'll consistently trigger the failsafe, for every VACUUM
> > against every table -- not just the table (or tables) whose
> > relfrozenxid is very old. That's still naive, but much less naive than
> > simply telling users to VACUUM the whole database in single user mode
> > while vacuuming indexes, etc.
>
> I understand the patch, thank you for the explanation!
>
> I remember Simon proposed a VACUUM command option[1], called
> FAST_FREEZE, to turn off index cleanup and heap truncation. Now that
> we have failsafe mechanism probably we can have a VACUUM command
> option to turn on failsafe mode instead.

I've been thinking a bit more about this, and I see two desirable
goals of anti-wraparound vacuum in single-user mode:

1. Get out of single-user mode as quickly as possible.

2. Minimize the catch-up work we have to do once we're out.

Currently, a naive vacuum does as much work as possible and leaves a
bunch of WAL streaming and archiving work for later, so that much is
easy to improve upon and we don't have to be terribly sophisticated.
Keeping in mind Andres' point that we don't want to force possibly
unwanted behavior just because we're in single-user mode, it makes
sense to have some kind of option that has the above two goals.
Instead of a boolean, it seems like the new option should specify some
age below which VACUUM will skip the table entirely, and above which
will enter fail-safe mode. As mentioned earlier, the shutdown hint
could spell out the exact command. With this design, it would specify
the fail-safe default, or something else, to use with the option. That
seems doable for v15 -- any thoughts on that approach?

In standard operation, the above goals could be restated as "advance
xmin as quickly as possible" and "generate as little future
'work/debt' as possible, whether dirty pages or WAL". There are some
more sophisticated things we can do in this regard, but something like
the above could also be useful in normal operation. In fact, that
"normal" could be just after we restarted after doing the bare-minimum
in single-user mode, and want to continue freezing and keep things
under control.

--
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2021-12-21 Thread John Naylor
I wrote:

> Instead of a boolean, it seems like the new option should specify some
> age below which VACUUM will skip the table entirely, and above which
> will enter fail-safe mode. As mentioned earlier, the shutdown hint
> could spell out the exact command. With this design, it would specify
> the fail-safe default, or something else, to use with the option.

On second thought, we don't really need another number here. We could
simply go by the existing failsafe parameter, and if the admin wants a
different value, it's already possible to specify
vacuum_failsafe_age/vacuum_multixact_failsafe_age in a session,
including in single-user mode. Perhaps a new boolean called
FAILSAFE_ONLY. If no table is specified, then when generating the list
of tables, include only those with relfrozenxid/relminmxid greater
than their failsafe thresholds.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode

2022-01-03 Thread John Naylor
e can be used on both platforms.

> I have updated the makefile to include the nasm command and the nasm flags, 
> but I need help to make these based on configure.
>
> I also have no knowledge on other operating systems (MAC-OS etc.)
>
> The calling conventions can be easily adopted if they differ but somebody 
> else should jump in for testing.

As I implied earlier, this is way too low-level. If we have to worry
about obj formats and calling conventions, we'd better be getting
something *really* amazing in return.

> But I really need help by an expert to integrate it in the perl building 
> process.

> I would much appreciate if someone else could jump in for a patch to 
> configure-integration and another patch for .vcxobj integration.

It's a bit presumptuous to enlist others for specific help without
general agreement on the design, especially on the most tedious parts.
Also, here's a general engineering tip: If the non-fun part is too
complex for you to figure out, that might indicate the fun part is too
ambitious. I suggest starting with a simple patch with SSE2 (always
present on x86-64) intrinsics, one that anyone can apply and test
without any additional work. Then we can evaluate if the speed-up in
the hex encoding case is worth some additional complexity. As part of
that work, it might be good to see if some portable improved algorithm
is already available somewhere.

> There is much room for other implementations (checksum verification/setting, 
> aggregation, numeric datatype, merging, generate_series, integer and floating 
> point output …) which could be addressed later on.

Float output has already been pretty well optimized. CRC checksums
already have a hardware implementation on x86 and Arm. I don't know of
any practical workload where generate_series() is too slow.
Aggregation is an interesting case, but I'm not sure what the current
bottlenecks are.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: speed up verifying UTF-8

2021-12-20 Thread John Naylor
On Fri, Dec 17, 2021 at 9:29 AM John Naylor
 wrote:
>
> I plan to push v25 early next week, unless there are further comments.

Pushed, thanks everyone!

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-11-23 Thread John Naylor
Hi,

I've looked over this patch set and email thread a couple times, and I
don't see anything amiss, but I'm also not terribly familiar with the
subsystems this part of the code relies on. I haven't yet tried to stress
test with a large database, but it seems like a good idea to do so.

I have a couple comments and questions:

0006:

+ * XXX We can optimize RelationMapOidToFileenodeForDatabase API
+ * so that instead of reading the relmap file every time, it can
+ * save it in a temporary variable and use it for subsequent
+ * calls.  Then later reset it once we're done or at the
+ * transaction end.

Do we really need to consider optimizing here? Only a handful of relations
will be found in the relmap, right?

+ * Once we start copying files from the source database, we need to be able
+ * to clean 'em up if we fail.  Use an ENSURE block to make sure this
+ * happens.  (This is not a 100% solution, because of the possibility of
+ * failure during transaction commit after we leave this routine, but it
+ * should handle most scenarios.)

This comment in master started with

- * Once we start copying subdirectories, we need to be able to clean 'em

Is the distinction important enough to change this comment? Also, is "most
scenarios" still true with the patch? I haven't read into how ENSURE works.

Same with this comment change, seems fine the way it was:

- * Use an ENSURE block to make sure we remove the debris if the copy fails
- * (eg, due to out-of-disk-space).  This is not a 100% solution, because
- * of the possibility of failure during transaction commit, but it should
- * handle most scenarios.
+ * Use an ENSURE block to make sure we remove the debris if the copy fails.
+ * This is not a 100% solution, because of the possibility of failure
+ * during transaction commit, but it should handle most scenarios.

And do we need additional tests? Maybe we don't, but it seems good to make
sure.

I haven't looked at 0007, and I have no opinion on which approach is better.

-- 
John Naylor
EDB: http://www.enterprisedb.com


Re: Parallel vacuum workers prevent the oldest xmin from advancing

2021-11-24 Thread John Naylor
On Wed, Nov 17, 2021 at 7:28 AM Amit Kapila  wrote:
>
> > The patch looks good to me. But I can't come up with a stable test for
> > this. It seems to be hard without stopping and resuming parallel
> > vacuum workers. Do you have any good idea?
> >
>
> No, let's wait for a day or so to see if anybody else has any ideas to
> write a test for this case, otherwise, I'll check these once again and
> push.

I set this "committed" in the CF app.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Non-decimal integer literals

2021-11-25 Thread John Naylor
Hi Peter,

0001

-/* we no longer allow unary minus in numbers.
- * instead we pass it separately to parser. there it gets
- * coerced via doNegate() -- Leon aug 20 1999
+/*
+ * Numbers
+ *
+ * Unary minus is not part of a number here.  Instead we pass it
separately to
+ * parser, and there it gets coerced via doNegate().

If we're going to change the comment anyway, "the parser" sounds more
natural. Aside from that, 0001 and 0002 can probably be pushed now, if you
like. I don't have any good ideas about 0003 at the moment.

0005

--- a/src/interfaces/ecpg/preproc/pgc.l
+++ b/src/interfaces/ecpg/preproc/pgc.l
@@ -365,6 +365,10 @@ real ({integer}|{decimal})[Ee][-+]?{digit}+
 realfail1 ({integer}|{decimal})[Ee]
 realfail2 ({integer}|{decimal})[Ee][-+]

+integer_junk {integer}{ident_start}
+decimal_junk {decimal}{ident_start}
+real_junk {real}{ident_start}

A comment might be good here to explain these are only in ECPG for
consistency with the other scanners. Not really important, though.

0006

+{hexfail} {
+ yyerror("invalid hexadecimal integer");
+ }
+{octfail} {
+ yyerror("invalid octal integer");
  }
-{decimal} {
+{binfail} {
+ yyerror("invalid binary integer");
+ }

It seems these could use SET_YYLLOC(), since the error cursor doesn't match
other failure states:

+SELECT 0b;
+ERROR:  invalid binary integer at or near "SELECT 0b"
+LINE 1: SELECT 0b;
+^
+SELECT 1b;
+ERROR:  trailing junk after numeric literal at or near "1b"
+LINE 1: SELECT 1b;
+   ^

We might consider some tests for ECPG since lack of coverage has been a
problem.

Also, I'm curious: how does the spec work as far as deciding the year of
release, or feature-freezing of new items?
--
John Naylor
EDB: http://www.enterprisedb.com


do only critical work during single-user vacuum?

2021-12-09 Thread John Naylor
When a user must shut down and restart in single-user mode to run
vacuum on an entire database, that does a lot of work that's
unnecessary for getting the system online again, even without
index_cleanup. We had a recent case where a single-user vacuum took
around 3 days to complete.

Now that we have a concept of a fail-safe vacuum, maybe it would be
beneficial to skip a vacuum in single-user mode if the fail-safe
criteria were not met at the beginning of vacuuming a relation. This
is not without risk, of course, but it should be much faster than
today and once up and running the admin would have a chance to get a
handle on things. Thoughts?

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: speed up verifying UTF-8

2021-12-08 Thread John Naylor
It occurred to me that the DFA + ascii quick check approach could also
be adapted to speed up some cases where we currently walk a string
counting characters, like this snippet in
text_position_get_match_pos():

/* Convert the byte position to char position. */
while (state->refpoint < state->last_match)
{
state->refpoint += pg_mblen(state->refpoint);
state->refpos++;
}

This coding changed in 9556aa01c69 (Use single-byte
Boyer-Moore-Horspool search even with multibyte encodings), in which I
found the majority of cases were faster, but some were slower. It
would be nice to regain the speed lost and do even better.

In the case of UTF-8, we could just run it through the DFA,
incrementing a count of the states found. The number of END states
should be the number of characters. The ascii quick check would still
be applicable as well. I think all that is needed is to export some
symbols and add the counting function. That wouldn't materially affect
the current patch for input verification, and would be separate, but
it would be nice to get the symbol visibility right up front. I've set
this to waiting on author while I experiment with that.

--
John Naylor
EDB: http://www.enterprisedb.com




Re: cutting down the TODO list thread

2021-12-08 Thread John Naylor
It's been a while, but here are a few more suggested
removals/edits/additions to the TODO list. Any objections or new
information, let me know:

- Auto-fill the free space map by scanning the buffer cache or by
checking pages written by the background writer
http://archives.postgresql.org/pgsql-hackers/2006-02/msg01125.php
https://www.postgresql.org/message-id/200603011716.16984.pete...@gmx.net

Both these threads are from 2006, so have nothing to do with the current FSM.

- Allow concurrent inserts to use recently created pages rather than
creating new ones
http://archives.postgresql.org/pgsql-hackers/2010-05/msg00853.php

Skimming the first few messages, I believe this has been covered by
commit 719c84c1b? (Extend relations multiple blocks at a time to
improve scalability.)

- Allow VACUUM FULL and CLUSTER to update the visibility map

This topic has a current CF entry which seems to have stalled, so that
newer thread would be better to list here than the one from 2013.

- Bias FSM towards returning free space near the beginning of the heap
file, in hopes that empty pages at the end can be truncated by VACUUM
http://archives.postgresql.org/pgsql-hackers/2009-09/msg01124.php
https://www.postgresql.org/message-id/20150424190403.gp4...@alvh.no-ip.org

I'm not sure what to think of this, but independently of that, the
second thread is actually talking about bringing back something like
the pre-9.0 vacuum full, so maybe it should be its own entry?

- Consider a more compact data representation for dead tuple locations
within VACUUM
http://archives.postgresql.org/pgsql-patches/2007-05/msg00143.php

Great, but let's link to this more recent thread instead:
https://www.postgresql.org/message-id/flat/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com

- Improve autovacuum tuning
http://www.postgresql.org/message-id/5078ad6b.8060...@agliodbs.com
http://www.postgresql.org/message-id/20130124215715.ge4...@alvh.no-ip.org

I'm kind of on the fence about these. The title is way too broad, and
I doubt we are going to forget to keep improving this area.

It seems the first thread is really about auto-analyze thresholds, so
maybe it should be in a separate entry if we want to do anything
mentioned in the thread?

The second thread is really about autovacuum launcher scheduling.
Probably still relevant, but the thread is very long and doesn't seem
terribly helpful to someone trying to get up to speed on the issues
that are still relevant. I don't see any more recent discussion,
either. Thoughts?

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2021-12-09 Thread John Naylor
On Thu, Dec 9, 2021 at 5:13 PM Peter Geoghegan  wrote:
> Oh, I think I misunderstood. Your concern is for the case where the
> DBA runs a simple "VACUUM" in single-user mode; you want to skip over
> tables that don't really need to advance relfrozenxid, automatically.

Right.

> I can see an argument for something like that, but I think that it
> should be a variant of VACUUM. Or maybe it could be addressed with a
> better user interface;

On Thu, Dec 9, 2021 at 6:08 PM Andres Freund  wrote:
> What if the user tried to reclaim space by vacuuming (via truncation)? Or is
> working around some corruption or such?  I think this is too much magic.
>
> That said, having a VACUUM "selector" that selects the oldest tables could be
> quite useful. And address this usecase both for single-user and normal
> operation.

All good points.

[Peter again]
> single-user mode should prompt the user about
> what exact VACUUM command they ought to run to get things going.

The current message is particularly bad in its vagueness because some
users immediately reach for VACUUM FULL, which quite logically seems
like the most complete thing to do.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: speed up verifying UTF-8

2021-12-13 Thread John Naylor
whose bit pattern encodes the state transitions. To compute the current
> > + * state, we simply right-shift the integer by the current state and apply 
> > a
> > + * mask. In this scheme, the address of the transition only depends on the
> > + * input byte, so there is better pipelining.
>
> Should be "To compute the *next* state, ...", I think.

Fixed.

> The way the state transition table works is pretty inscrutable. That's
> understandable, because the values were found by an SMT solver, so I'm
> not sure if anything can be done about it.

Do you mean in general, or just the state values?

Like any state machine, the code is simple, and the complexity is
hidden in the data. Hopefully the first link I included in the comment
is helpful.

The SMT solver was only needed to allow 32-bit (instead of 64-bit)
entries in the transition table, so it's not strictly necessary. A
lookup table that fits in 1kB is nice from a cache perspective,
however.

With 64-bit, the state values are less weird-looking but they're still
just arbitrary numbers. As long as ERR = 0 and the largest is at most
9, it doesn't matter what they are, so I'm not sure it's much less
mysterious. You can see the difference between 32-bit and 64-bit in
[1].

--
In addition to Heikki's. review points, I've made a couple small
additional changes from v24: I rewrote this part, so we don't need
these macros anymore:

-   if (!IS_HIGHBIT_SET(*s) ||
-   IS_UTF8_2B_LEAD(*s) ||
-   IS_UTF8_3B_LEAD(*s) ||
-   IS_UTF8_4B_LEAD(*s))
+   if (!IS_HIGHBIT_SET(*s) || pg_utf_mblen(s) > 1)

And I moved is_valid_ascii() to pg_wchar.h so it can be used
elsewhere. I'm not sure there's a better place to put it. I tried
using this for text_position(), for which I'll start a new thread.

[1] 
https://www.postgresql.org/message-id/attachment/125672/v22-addendum-32-bit-transitions.txt



--
John Naylor
EDB: http://www.enterprisedb.com


--
John Naylor
EDB: http://www.enterprisedb.com
From 0c20e37ac47a5759e6ec674bb4184d835c562af8 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 19 Oct 2021 16:43:14 -0400
Subject: [PATCH v25] Add fast path for validating UTF-8 text

Our previous validator used a traditional algorithm that performed
comparison and branching one byte at a time. It's useful in that
we always know exactly how many bytes we have validated, but that
precision comes at a cost. Input validation can show up prominently
in profiles of COPY FROM, and future improvements to COPY FROM such
as parallelism or faster line parsing will put more pressure on input
validation. Hence, add fast paths for both ASCII and multibyte UTF-8:

Use bitwise operations to check 16 bytes at a time for ASCII. If
that fails, use a "shift-based" DFA on those bytes to handle the
general case, including multibyte. These paths are relatively free
of branches and thus robust against all kinds of byte patterns. With
these algorithms, UTF-8 validation is several times faster, depending
on platform and the input byte distribution.

The previous coding in pg_utf8_verifystr() is retained for short
strings and for when the fast path returns an error.

Review, performance testing, and additional hacking by: Heikki
Linakangas, Vladimir Sitnikov, Amit Khandekar, Thomas Munro, and
Greg Stark

Discussion:
https://www.postgresql.org/message-id/CAFBsxsEV_SzH%2BOLyCiyon%3DiwggSyMh_eF6A3LU2tiWf3Cy2ZQg%40mail.gmail.com
---
 src/common/wchar.c   | 215 +++
 src/include/mb/pg_wchar.h|  53 ++
 src/test/regress/expected/conversion.out | 169 ++
 src/test/regress/sql/conversion.sql  | 133 ++
 4 files changed, 570 insertions(+)

diff --git a/src/common/wchar.c b/src/common/wchar.c
index a6bffd0642..be931c5e92 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -1750,11 +1750,226 @@ pg_utf8_verifychar(const unsigned char *s, int len)
 	return l;
 }
 
+/*
+ * The fast path of the UTF-8 verifier uses a deterministic finite automaton
+ * (DFA) for multibyte characters. In a traditional table-driven DFA, the
+ * input byte and current state are used to compute an index into an array of
+ * state transitions. Since the address of the next transition is dependent
+ * on this computation, there is latency in executing the load instruction,
+ * and the CPU is not kept busy.
+ *
+ * Instead, we use a "shift-based" DFA as described by Per Vognsen:
+ *
+ * https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725
+ *
+ * In a shift-based DFA, the input byte is an index into array of integers
+ * whose bit pattern encodes the state transitions. To compute the next
+ * state, we simply right-shift the integer by the current state and apply a
+ * mask. In this scheme, the address of the transition only depends on the
+ * input by

speed up text_position() for utf-8

2021-12-13 Thread John Naylor
Hi,

Commit 9556aa01c69 (Use single-byte Boyer-Moore-Horspool search even
with multibyte encodings), was a speed improvement for the majority of
cases, but when the match was toward the end of the string, the
slowdown in text_position_get_match_pos() was noticeable. It was found
that there was a lot of overhead in pg_mblen(). [1]

The attached exploratory PoC improves this for utf-8. It applies on
top v25 of my utf-8 verification patch in [2], since one approach
relies on the DFA from it. The other three approaches are:
- a version of pg_utf_mblen() that uses a lookup table [3]
- an inlined copy of pg_utf_mblen()
- an ascii fast path with a fallback to the inlined copy of pg_utf_mblen()

The test is attached and the test function is part of the patch. It's
based on the test used in the commit above. The test searches for a
string that's at the end of a ~1 million byte string. This is on gcc
11 with 2-3 runs to ensure repeatability, but I didn't bother with
statistics because the differences are pretty big:

  patch  | no match | ascii | mulitbyte
-+--+---+---
 PG11| 1120 |  1100 |   900
 master  |  381 |  2350 |  1900
 DFA |  386 |  1640 |  1640
 branchless utf mblen|  387 |  4100 |  2600
 inline pg_utf_mblen()   |  380 |  1080 |   920
 inline pg_utf_mblen() + ascii fast path |  382 |   470 |   918

Neither of the branchless approaches worked well. The DFA can't work
as well here as in verification because it must do additional work.
Inlining pg_utf_mblen() restores worst-case performance to PG11
levels. The ascii fast path is a nice improvement on top of that. A
similar approach could work for pg_mbstrlen() as well, but I haven't
looked into that yet. There are other callers of pg_mblen(), but I
haven't looked into whether they are performance-sensitive. A more
general application would be preferable to a targeted one.

[1] 
https://www.postgresql.org/message-id/b65df3d8-1f59-3bd7-ebbe-68b81d5a76a4%40iki.fi
[2] 
https://www.postgresql.org/message-id/CAFBsxsHG%3Dg6W8Mie%2B_NO8dV6O0pO2stxrnS%3Dme5ZmGqk--fd5g%40mail.gmail.com
[3] https://github.com/skeeto/branchless-utf8/blob/master/utf8.h

--
John Naylor
EDB: http://www.enterprisedb.com
 src/backend/utils/adt/varlena.c | 112 
 src/common/wchar.c  |  90 ++--
 src/include/mb/pg_wchar.h   |  53 ---
 src/test/regress/regress.c  |  18 +++
 4 files changed, 133 insertions(+), 140 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index bd3091bbfb..cc93818007 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -26,6 +26,7 @@
 #include "common/unicode_norm.h"
 #include "lib/hyperloglog.h"
 #include "libpq/pqformat.h"
+#include "mb/pg_utf8.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
 #include "parser/scansup.h"
@@ -1468,6 +1469,116 @@ text_position_get_match_pos(TextPositionState *state)
 {
 	if (!state->is_multibyte)
 		return state->last_match - state->str1 + 1;
+	else if (GetDatabaseEncoding() == PG_UTF8)
+	{
+#if 0	/* dfa */
+
+		int			utf8_state_count[UTF8_LAST_STATE + 1] = {0};
+		uint32		dfa_state = BGN;
+
+		/*
+		 * See pg_utf8.h for an explanation of the state machine. Unlike in
+		 * pg_wchar.c, we must calculate the exact state so we can use it as
+		 * an array index.
+		 */
+		while (state->refpoint < state->last_match)
+		{
+			const unsigned char *s = (const unsigned char *) state->refpoint++;
+
+			dfa_state = (Utf8Transition[*s] >> dfa_state) & 31;
+			utf8_state_count[dfa_state]++;
+		}
+		Assert(state->refpoint == state->last_match);
+		state->refpos += utf8_state_count[END];
+		return state->refpos + 1;
+
+#endif
+#if 0	/* like pg_utf_mblen(), but different algorithm: */
+
+		static const char lengths[] = {
+			1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+			1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 1
+		};
+
+		while (state->refpoint < state->last_match)
+		{
+			const unsigned char *s = (const unsigned char *) state->refpoint;
+
+			state->refpoint += lengths[s[0] >> 3];
+			state->refpos++;
+		}
+
+		Assert(state->refpoint == state->last_match);
+		return state->refpos + 1;
+#endif
+#if 0	/* inline pg_utf_mblen()*/
+
+		int len = 0;
+
+		while (state->refpoint < state->last_match)
+		{
+			const unsigned char *s = (const unsigned char *) state->refpoint;
+
+			if ((*s & 0x80) == 0)
+len = 1;
+			else if ((*s & 0xe0) == 0xc0)
+len = 2;
+			else if ((*s & 0xf0) == 0xe0)
+len = 3;
+			e

Re: cutting down the TODO list thread

2021-12-14 Thread John Naylor
On Wed, Dec 8, 2021 at 1:40 PM John Naylor  wrote:
>
> It's been a while, but here are a few more suggested
> removals/edits/additions to the TODO list. Any objections or new
> information, let me know:
>
> - Auto-fill the free space map by scanning the buffer cache or by
> checking pages written by the background writer
> http://archives.postgresql.org/pgsql-hackers/2006-02/msg01125.php
> https://www.postgresql.org/message-id/200603011716.16984.pete...@gmx.net
>
> Both these threads are from 2006, so have nothing to do with the current FSM.

Moved to the Not Worth Doing list.

> - Allow concurrent inserts to use recently created pages rather than
> creating new ones
> http://archives.postgresql.org/pgsql-hackers/2010-05/msg00853.php
>
> Skimming the first few messages, I believe this has been covered by
> commit 719c84c1b? (Extend relations multiple blocks at a time to
> improve scalability.)

Removed.

> - Allow VACUUM FULL and CLUSTER to update the visibility map
>
> This topic has a current CF entry which seems to have stalled, so that
> newer thread would be better to list here than the one from 2013.

Added.

> - Bias FSM towards returning free space near the beginning of the heap
> file, in hopes that empty pages at the end can be truncated by VACUUM
> http://archives.postgresql.org/pgsql-hackers/2009-09/msg01124.php
> https://www.postgresql.org/message-id/20150424190403.gp4...@alvh.no-ip.org
>
> I'm not sure what to think of this, but independently of that, the
> second thread is actually talking about bringing back something like
> the pre-9.0 vacuum full, so maybe it should be its own entry?

Done.

> - Consider a more compact data representation for dead tuple locations
> within VACUUM
> http://archives.postgresql.org/pgsql-patches/2007-05/msg00143.php
>
> Great, but let's link to this more recent thread instead:
> https://www.postgresql.org/message-id/flat/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com

Done.

> > The second thread is really about autovacuum launcher scheduling.
> > Probably still relevant, but the thread is very long and doesn't seem
> > terribly helpful to someone trying to get up to speed on the issues
> > that are still relevant. I don't see any more recent discussion,
> > either. Thoughts?

Split into two entries.

On Wed, Dec 8, 2021 at 8:12 PM Masahiko Sawada  wrote:
> There is another discussion on autovacuum scheduling in 2018 here:
>
> https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D1F8A4DC6%40G01JPEXMBYT05
>
> Some algorithms were proposed there and I implemented a PoC patch:
>
> https://www.postgresql.org/message-id/CAD21AoBUaSRBypA6pd9ZD%3DU-2TJCHtbyZRmrS91Nq0eVQ0B3BA%40mail.gmail.com

Added, thanks!

--
John Naylor
EDB: http://www.enterprisedb.com




Re: speed up text_position() for utf-8

2021-12-15 Thread John Naylor
I wrote:

> The test is attached and the test function is part of the patch. It's
> based on the test used in the commit above. The test searches for a
> string that's at the end of a ~1 million byte string. This is on gcc
> 11 with 2-3 runs to ensure repeatability, but I didn't bother with
> statistics because the differences are pretty big:
>
>   patch  | no match | ascii | mulitbyte
> -+--+---+---
>  PG11| 1120 |  1100 |   900
>  master  |  381 |  2350 |  1900
>  DFA |  386 |  1640 |  1640
>  branchless utf mblen|  387 |  4100 |  2600
>  inline pg_utf_mblen()   |  380 |  1080 |   920
>  inline pg_utf_mblen() + ascii fast path |  382 |   470 |   918

I failed to mention that the above numbers are milliseconds, so
smaller is better.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: speed up verifying UTF-8

2021-07-20 Thread John Naylor
> On Mon, Jul 19, 2021 at 9:43 AM Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:

> > An alternative idea: should we optimize for validation of **valid**
inputs rather than optimizing the worst case?
> > In other words, what if the implementation processes all characters
always and uses a slower method in case of validation failure?
> > I would guess it is more important to be faster with accepting valid
input rather than "faster to reject invalid input".
>
> > static int pg_utf8_verifystr2(const unsigned char *s, int len) {
> > if (pg_is_valid_utf8(s, s+len)) { // fast path: if string is valid,
then just accept it
> > return s + len;
> > }
> > // slow path: the string is not valid, perform a slower analysis
> > return s + ;
> > }

This turned out to be a really good idea (v19 attached):

Power8, gcc 4.8:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
2944 |  1523 |   871 |1473 |   1509

v14:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 888 |   607 |   179 | 777 |   1328

v19:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 809 |   472 |   223 | 558 |805

x86 Macbook, clang 12:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 974 |   691 |   370 | 456 |526

v14:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 674 |   346 |78 | 309 |504

v19:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 379 |   181 |94 | 219 |376

Note that the branchy code's worst case (mixed8) is here the same speed as
multibyte. With Vladimir's idea * , we call check_ascii only every 8 bytes
of input, not every time we verify one multibyte character. Also, we only
have to check the DFA state every time we loop over 8 bytes, not every time
we step through the DFA. That means we have to walk backwards at the end to
find the last leading byte, but the SSE code already knew how to do that,
so I used that logic here in the caller, which will allow some
simplification of how the SSE code returns.

The state check is likely why the ascii case is slightly slower than v14.
We could go back to checking ascii 16 bytes at a time, since there's little
penalty for doing so.

* (Greg was thinking the same thing upthread, but I don't think the branchy
code I posted at the time could have taken advantage of this)

I'm pretty confident this improvement is architecture-independent. Next
month I'll clean this up and rebase the SSE patch over this.

I wrote:

> + /*
> + * NB: This check must be strictly greater-than, otherwise an invalid
byte
> + * at the end might not get detected.
> + */
> + while (len > sizeof(__m128i))

Note to self: I actually think this isn't needed anymore since I changed
how the SSE code deals with remainder sequences at the end.

--
John Naylor
EDB: http://www.enterprisedb.com


v19-rewrite-pg_verify_str-for-speed.patch
Description: Binary data


Re: add 'noError' to euc_tw_and_big5.c

2021-07-21 Thread John Naylor
On Tue, Jul 20, 2021 at 10:35 PM Michael Paquier 
wrote:
>
> On Wed, Jul 21, 2021 at 02:15:14AM +, wangyu...@fujitsu.com wrote:
> > 'noError' argument was added at commit ea1b99a661,
> > but it seems to be neglected in euc_tw_and_big5.c Line 289.
> > please see the attachment.
>
> That sounds right to me.  Double-checking the area, I am not seeing
> another portion of the code to fix.

Agreed, will push.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: add 'noError' to euc_tw_and_big5.c

2021-07-21 Thread John Naylor
On Tue, Jul 20, 2021 at 10:35 PM Michael Paquier 
wrote:
>
> On Wed, Jul 21, 2021 at 02:15:14AM +, wangyu...@fujitsu.com wrote:
> > 'noError' argument was added at commit ea1b99a661,
> > but it seems to be neglected in euc_tw_and_big5.c Line 289.
> > please see the attachment.
>
> That sounds right to me.  Double-checking the area, I am not seeing
> another portion of the code to fix.

Pushed, but I forgot to give you review credit, sorry about that. Thanks
for taking a look!

--
John Naylor
EDB: http://www.enterprisedb.com


Re: postgresql.conf.sample missing units

2021-07-21 Thread John Naylor
On Mon, Jul 19, 2021 at 10:31 AM John Naylor 
wrote:
>
> On Mon, Jul 19, 2021 at 5:44 AM Pavel Luzanov 
wrote:
> >
> > Hello,
> >
> > I found that the start section of the postgresql.conf file is missing a
> > description of two units: bytes (appeared in version 11) and
> > microseconds (in version 12).
> >
> > The attached patch makes these changes to the postgresql.conf.sample
file.
>
> Seems like an oversight. I'll commit this soon barring objections.

I pushed this and backpatched to v12. I also extracted just the "bytes"
part and applied it to v11.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: speed up verifying UTF-8

2021-07-18 Thread John Naylor
I wrote:

> On Fri, Jul 16, 2021 at 1:44 AM Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:
> >
> > Have you considered shift-based DFA for a portable implementation
https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 ?
>
> I did consider some kind of DFA a while back and it was too slow.

I took a closer look at this "shift-based DFA", and it seemed pretty
straightforward to implement this on top of my DFA attempt from some months
ago. The DFA technique is not a great fit with our API, since we need to
return how many bytes we found valid. On x86 (not our target for the
fallback, but convenient to test) all my attempts were either worse than
HEAD in multiple cases, or showed no improvement for the important ASCII
case. On Power8, it's more compelling, and competitive with v14, so I'll
characterize it on that platform as I describe the patch series:

0001 is a pure DFA, and has decent performance on multibyte, but terrible
on ascii.
0002 dispatches on the leading byte category, unrolls the DFA loop
according to how many valid bytes we need, and only checks the DFA state
afterwards. It's good on multibyte (3-byte, at least) but still terrible on
ascii.
0003 adds a 1-byte ascii fast path -- while robust on all inputs, it still
regresses a bit on ascii.
0004 uses the same 8-byte ascii check as previous patches do.
0005 and 0006 use combinations of 1- and 8-byte ascii checks similar to in
v17.

0005 seems the best on Power8, and is very close to v4. FWIW, v14's
measurements seem lucky and fragile -- if I change any little thing, even

- return -1;
+ return 0;

it easily loses 100-200ms on non-pure-ascii tests. That said, v14 still
seems the logical choice, unless there is some further tweak on top of v17
or v18 that gives some non-x86 platform a significant boost.

Power8, gcc 4.8:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
2944 |  1523 |   871 |1473 |   1509

v18-0001:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1257 |  1681 |  1385 |1744 |   2018

v18-0002:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 951 |  1381 |  1217 |1469 |   1172

v18-0003:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 911 |   |   942 |1112 |865

v18-0004:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 987 |   730 |   222 |1325 |   2306

v18-0005:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 962 |   664 |   180 | 928 |   1179

v18-0006:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 908 |   663 |   244 |1026 |   1464

and for comparison,

v14:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 888 |   607 |   179 | 777 |   1328

v17-0003:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1205 |   662 |   180 | 767 |   1256


Macbook, clang 12:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 974 |   691 |   370 | 456 |526

v18-0001:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1334 |  2713 |  2802 |2665 |   2541

v18-0002:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 733 |  1212 |  1064 |1034 |   1007

v18-0003:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 653 |   560 |   370 | 420 |465

v18-0004:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 574 |   402 |88 | 584 |   1033

v18-0005:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1345 |   730 |   334 | 578 |909

v18-0006:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 674 |   485 |   153 | 594 |989

and for comparison,

v14:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 674 |   346 |78 | 309 |504

v17-0002:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 516 |   324 |78 | 331 |544

-- 
John Naylor
EDB: http://www.enterprisedb.com


v18-0001-Use-pure-DFA.patch
Description: Binary data


v18-0002-Unroll-loop-in-DFA.patch
Description: Binary data


v18-0004-Check-ascii-8-bytes-at-a-time-with-bitwise-opera.patch
Description: Binary data


v18-0003-Add-ascii-fast-path-before-resorting-to-DFA.patch
Description: Binary data


v18-0005-Do-1-byte-ascii-check-if-8-byte-check-fails.patch
Description: Binary data


v18-0006-Do-8-byte-check-only-if-1-byte-check-succeeds.patch
Description: Binary data


Re: O_DIRECT on macOS

2021-07-19 Thread John Naylor
On Mon, Jul 19, 2021 at 12:55 AM Thomas Munro 
wrote:
>
> On Mon, Jul 19, 2021 at 4:42 PM Tom Lane  wrote:
> > prairiedog thinks that Assert is too optimistic about whether all
> > those flags exist.
>
> Fixed.
>
> (Huh, I received no -committers email for 2dbe8905.)

It didn't show up in the archives, either. Neither did your follow-up
04cad8f7bc.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: postgresql.conf.sample missing units

2021-07-19 Thread John Naylor
On Mon, Jul 19, 2021 at 5:44 AM Pavel Luzanov 
wrote:
>
> Hello,
>
> I found that the start section of the postgresql.conf file is missing a
> description of two units: bytes (appeared in version 11) and
> microseconds (in version 12).
>
> The attached patch makes these changes to the postgresql.conf.sample file.

Seems like an oversight. I'll commit this soon barring objections.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: speed up verifying UTF-8

2021-07-19 Thread John Naylor
On Mon, Jul 19, 2021 at 9:43 AM Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:

> It looks like it is important to have shrx for x86 which appears only
when -march=x86-64-v3 is used (see
https://github.com/golang/go/issues/47120#issuecomment-877629712 ).
> Just in case: I know x86 wound not use fallback implementation, however,
the sole purpose of shift-based DFA is to fold all the data-dependent ops
into a single instruction.

I saw mention of that instruction, but didn't understand how important it
was, thanks.

> An alternative idea: should we optimize for validation of **valid**
inputs rather than optimizing the worst case?
> In other words, what if the implementation processes all characters
always and uses a slower method in case of validation failure?
> I would guess it is more important to be faster with accepting valid
input rather than "faster to reject invalid input".

> static int pg_utf8_verifystr2(const unsigned char *s, int len) {
> if (pg_is_valid_utf8(s, s+len)) { // fast path: if string is valid,
then just accept it
> return s + len;
> }
> // slow path: the string is not valid, perform a slower analysis
> return s + ;
> }

That might be workable. We have to be careful because in COPY FROM,
validation is performed on 64kB chunks, and the boundary could fall in the
middle of a multibyte sequence. In the SSE version, there is this comment:

+ /*
+ * NB: This check must be strictly greater-than, otherwise an invalid byte
+ * at the end might not get detected.
+ */
+ while (len > sizeof(__m128i))

...which should have more detail on this.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: speed up verifying UTF-8

2021-07-21 Thread John Naylor
On Wed, Jul 21, 2021 at 12:13 PM Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:
> It looks like the same utf8_advance function is good for both fast-path
and for the slow path.
> Then pg_utf8_verifychar could be removed altogether along with the
corresponding IS_*_BYTE_LEAD macros.

pg_utf8_verifychar() is a public function usually called
through pg_wchar_table[], so it needs to remain in any case.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: truncating timestamps on arbitrary intervals

2021-07-22 Thread John Naylor
On Thu, Jul 22, 2021 at 4:49 PM Bauyrzhan Sakhariyev <
baurzhansahar...@gmail.com> wrote:
> Not related to negative interval - I created a PR for adding zero check
for stride https://github.com/postgres/postgres/pull/67 and after getting
it closed I stopped right there - 1 line check doesn't worth going through
the patching process I'm not familiar with.

Thanks for the pull request! I added tests and reworded the error message
slightly to match current style, and pushed.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [POC] verifying UTF-8 using SIMD instructions

2021-07-21 Thread John Naylor
On Wed, Jul 21, 2021 at 11:29 AM Thomas Munro 
wrote:

> Just for fun/experimentation, here's a quick (and probably too naive)
> translation of those helper functions to NEON, on top of the v15
> patch.

Neat! It's good to make it more architecture-agnostic, and I'm sure we can
use quite a bit of this. I don't know enough about NEON to comment
intelligently, but a quick glance through the simdjson source show a couple
differences that might be worth a look:

 to_bool(const pg_u8x16_t v)
 {
+#if defined(USE_NEON)
+ return vmaxvq_u32((uint32x4_t) v) != 0;

--> return vmaxvq_u8(*this) != 0;

 vzero()
 {
+#if defined(USE_NEON)
+ return vmovq_n_u8(0);

--> return vdupq_n_u8(0); // or equivalently, splat(0)

is_highbit_set(const pg_u8x16_t v)
 {
+#if defined(USE_NEON)
+ return to_bool(bitwise_and(v, vmovq_n_u8(0x80)));

--> return vmaxq_u8(v) > 0x7F

(Technically, their convention is: is_ascii(v) { return vmaxq_u8(v) < 0x80;
} , but same effect)

+#if defined(USE_NEON)
+static pg_attribute_always_inline pg_u8x16_t
+vset(uint8 v0, uint8 v1, uint8 v2, uint8 v3,
+ uint8 v4, uint8 v5, uint8 v6, uint8 v7,
+ uint8 v8, uint8 v9, uint8 v10, uint8 v11,
+ uint8 v12, uint8 v13, uint8 v14, uint8 v15)
+{
+ uint8 pg_attribute_aligned(16) values[16] = {
+ v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15
+ };
+ return vld1q_u8(values);
+}

--> They have this strange beast instead:

  // Doing a load like so end ups generating worse code.
  // uint8_t array[16] = {x1, x2, x3, x4, x5, x6, x7, x8,
  // x9, x10,x11,x12,x13,x14,x15,x16};
  // return vld1q_u8(array);
  uint8x16_t x{};
  // incredibly, Visual Studio does not allow x[0] = x1
  x = vsetq_lane_u8(x1, x, 0);
  x = vsetq_lane_u8(x2, x, 1);
  x = vsetq_lane_u8(x3, x, 2);
...
  x = vsetq_lane_u8(x15, x, 14);
  x = vsetq_lane_u8(x16, x, 15);
  return x;

Since you aligned the array, that might not have the problem alluded to
above, and it looks nicer.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [POC] verifying UTF-8 using SIMD instructions

2021-07-22 Thread John Naylor
On Wed, Jul 21, 2021 at 8:08 PM Thomas Munro  wrote:
>
> On Thu, Jul 22, 2021 at 6:16 AM John Naylor

> One question is whether this "one size fits all" approach will be
> extensible to wider SIMD.

Sure, it'll just take a little more work and complexity. For one, 16-byte
SIMD can operate on 32-byte chunks with a bit of repetition:

-   __m128i input;
+   __m128i input1;
+   __m128i input2;

-#define SIMD_STRIDE_LENGTH (sizeof(__m128i))
+#define SIMD_STRIDE_LENGTH 32

while (len >= SIMD_STRIDE_LENGTH)
{
-   input = vload(s);
+   input1 = vload(s);
+   input2 = vload(s + sizeof(input1));

-   check_for_zeros(input, );
+   check_for_zeros(input1, );
+   check_for_zeros(input2, );

/*
 * If the chunk is all ASCII, we can skip the full UTF-8
check, but we
@@ -460,17 +463,18 @@ pg_validate_utf8_sse42(const unsigned char *s, int
len)
 * sequences at the end. We only update prev_incomplete if
the chunk
 * contains non-ASCII, since the error is cumulative.
 */
-   if (is_highbit_set(input))
+   if (is_highbit_set(bitwise_or(input1, input2)))
{
-   check_utf8_bytes(prev, input, );
-   prev_incomplete = is_incomplete(input);
+   check_utf8_bytes(prev, input1, );
+   check_utf8_bytes(input1, input2, );
+   prev_incomplete = is_incomplete(input2);
}
else
{
error = bitwise_or(error, prev_incomplete);
}

-   prev = input;
+   prev = input2;
s += SIMD_STRIDE_LENGTH;
len -= SIMD_STRIDE_LENGTH;
}

So with a few #ifdefs, we can accommodate two sizes if we like.

For another, the prevN() functions would need to change, at least on x86 --
that would require replacing _mm_alignr_epi8() with _mm256_alignr_epi8()
plus _mm256_permute2x128_si256(). Also, we might have to do something with
the vector typedef.

That said, I think we can punt on that until we have an application that's
much more compute-intensive. As it is with SSE4, COPY FROM WHERE  already pushes the utf8 validation way down in profiles.

> FWIW here are some performance results from my humble RPI4:
>
> master:
>
>  chinese | mixed | ascii
> -+---+---
> 4172 |  2763 |  1823
> (1 row)
>
> Your v15 patch:
>
>  chinese | mixed | ascii
> -+---+---
> 2267 |  1248 |   399
> (1 row)
>
> Your v15 patch set + the NEON patch, configured with USE_UTF8_SIMD=1:
>
>  chinese | mixed | ascii
> -+---+---
>  909 |   620 |   318
> (1 row)
>
> It's so good I wonder if it's producing incorrect results :-)

Nice! If it passes regression tests, it *should* be fine, but stress
testing would be welcome on any platform.

> I also tried to do a quick and dirty AltiVec patch to see if it could
> fit into the same code "shape", with less immediate success: it works
> out slower than the fallback code on the POWER7 machine I scrounged an
> account on.  I'm not sure what's wrong there, but maybe it's a uesful
> start (I'm probably confused about endianness, or the encoding of
> boolean vectors which may be different (is true 0x01or 0xff, does it
> matter?), or something else, and it's falling back on errors all the
> time?).

Hmm, I have access to a power8 machine to play with, but I also don't mind
having some type of server-class hardware that relies on the recent nifty
DFA fallback, which performs even better on powerpc64le than v15.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Use quick select instead of qsort to get median

2021-07-22 Thread John Naylor
On Thu, Jul 22, 2021 at 8:07 AM houzj.f...@fujitsu.com <
houzj.f...@fujitsu.com> wrote:
>
> Hi,
>
> When I was writing an extension which need to get the median of an array,
I
> tried to find if postgres provide some api that can do that. I found all
the
> places in postgres invoke qsort() and then get the median. I was thinking
can
> we do better by using "quick select" and is it worth it.

> Attach a POC patch about this idea. I did some simple performance tests,
I can
> see about 10% performance gain in this test[2].
>
> Thoughts ?

> 1.
> entry_dealloc
> ...
> /* Record the (approximate) median usage */
> if (i > 0)
> pgss->cur_median_usage = entries[i / 2]->counters.usage;

It might be useful to be more precise here, but it seems it would be
slower, too?

> -test create index
> drop index quad_box_tbl_idx;
> CREATE INDEX quad_box_tbl_idx ON quad_box_tbl USING spgist(b);
>
> ---test results
> PATCH:
> Time: 2609.664 ms (00:02.610)
>
> HEAD:
> Time: 2903.765 ms (00:02.944)

That index type is pretty rare, as far as I know. That doesn't seem to be
quite enough motivation to change the qsort template. If the goal was to
improve the speed of "create spgist index", would this still be the best
approach? Also, there are other things under consideration that would add
complexity to the qsort template [1], and this would add even more.

Looking in the docs [2], we don't have a MEDIAN aggregate, but we do have
percentile_disc(), and quick select might help there, but I haven't looked.

[1] https://commitfest.postgresql.org/33/3038/
[2] https://www.postgresql.org/docs/14/functions-aggregate.html
--
John Naylor
EDB: http://www.enterprisedb.com


Re: truncating timestamps on arbitrary intervals

2021-07-22 Thread John Naylor
On Thu, Jul 22, 2021 at 12:24 PM Bauyrzhan Sakhariyev <
baurzhansahar...@gmail.com> wrote:
>
> Is date_bin supposed to return the beginning of the bin?

Thanks for testing! And yes.

> And does the sign of an interval define the "direction" of the bin?

No, the boundary is intentionally the earlier one:

/*
 * Make sure the returned timestamp is at the start of the bin, even if
 * the origin is in the future.
 */
if (origin > timestamp && stride_usecs > 1)
tm_delta -= stride_usecs;

I wonder if we should just disallow negative intervals here.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: speed up verifying UTF-8

2021-07-16 Thread John Naylor
On Fri, Jul 16, 2021 at 1:44 AM Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:
>
> Have you considered shift-based DFA for a portable implementation
https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 ?

I did consider some kind of DFA a while back and it was too slow.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: speed up verifying UTF-8

2021-07-16 Thread John Naylor
Forgot the attachments...

--
John Naylor
EDB: http://www.enterprisedb.com


v17-0001-Rewrite-pg_utf8_verifystr-for-speed.patch
Description: Binary data


v17-0004-Second-attempt-at-addressing-performance-regress.patch
Description: Binary data


v17-0002-Use-integer-chunk-for-fast-path-multibyte-check.patch
Description: Binary data


v17-0003-First-attempt-at-addressing-performance-regressi.patch
Description: Binary data


Re: speed up verifying UTF-8

2021-07-16 Thread John Naylor
My v16 experimental patches were a bit messy, so I've organized an
experimental series that applies cumulatively, to try to trace the effects
of various things.

v17-0001 is the same as v14. 0002 is a stripped-down implementation of
Amit's chunk idea for multibyte, and it's pretty good on x86. On Power8,
not so much. 0003 and 0004 are shot-in-the-dark guesses to improve it on
Power8, with some success, but end up making x86 weirdly slow, so I'm
afraid that could happen on other platforms as well.

v14 still looks like the safe bet for now. It also has the advantage of
using the same function both in and out of the fastpath, which will come in
handy when moving it to src/port as the fallback for SSE.

Power8, gcc 4.8:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
2944 |  1523 |   871 |1473 |   1509

v17-0001:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 888 |   607 |   179 | 777 |   1328

v17-0002:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1017 |   718 |   156 |1213 |   2138

v17-0003:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1205 |   662 |   180 | 767 |   1256

v17-0004:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1085 |   660 |   224 | 868 |   1369


Macbook x86, clang 12:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 974 |   691 |   370 | 456 |526

v17-0001:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 674 |   346 |78 | 309 |504

v17-0002:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 516 |   324 |78 | 331 |544

v17-0003:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 621 |   537 |   323 | 413 |602

v17-0004:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 576 |   439 |   154 | 557 |915

--
John Naylor
EDB: http://www.enterprisedb.com


Re: speed up verifying UTF-8

2021-07-15 Thread John Naylor
On Thu, Jul 15, 2021 at 1:10 AM Amit Khandekar 
wrote:

> - check_ascii() seems to be used only for 64-bit chunks. So why not
> remove the len argument and the len <= sizeof(int64) checks inside the
> function. We can rename it to check_ascii64() for clarity.

Thanks for taking a look!

Well yes, but there's nothing so intrinsic to 64 bits that the name needs
to reflect that. Earlier versions worked on 16 bytes at time. The compiler
will optimize away the len check, but we could replace with an assert
instead.

> - I was thinking, why not have a pg_utf8_verify64() that processes
> 64-bit chunks (or a 32-bit version). In check_ascii(), we anyway
> extract a 64-bit chunk from the string. We can use the same chunk to
> extract the required bits from a two byte char or a 4 byte char. This
> way we can avoid extraction of separate bytes like b1 = *s; b2 = s[1]
> etc.

Loading bytes from L1 is really fast -- I wouldn't even call it
"extraction".

> More importantly, we can avoid the separate continuation-char
> checks for each individual byte.

On a pipelined superscalar CPU, I wouldn't expect it to matter in the
slightest.

> Additionally, we can try to simplify
> the subsequent overlong or surrogate char checks. Something like this

My recent experience with itemptrs has made me skeptical of this kind of
thing, but the idea was interesting enough that I couldn't resist trying it
out. I have two attempts, which are attached as v16*.txt and apply
independently. They are rough, and some comments are now lies. To simplify
the constants, I do shift down to uint32, and I didn't bother working
around that. v16alpha regressed on worst-case input, so for v16beta I went
back to earlier coding for the one-byte ascii check. That helped, but it's
still slower than v14.

That was not unexpected, but I was mildly shocked to find out that v15 is
also slower than the v14 that Heikki posted. The only non-cosmetic
difference is using pg_utf8_verifychar_internal within pg_utf8_verifychar.
I'm not sure why it would make such a big difference here. The numbers on
Power8 / gcc 4.8 (little endian):

HEAD:

 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
2951 |  1521 |   871 |1474 |   1508

v14:

 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 885 |   607 |   179 | 774 |   1325

v15:

 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1085 |   671 |   180 |1032 |   1799

v16alpha:

 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1268 |   822 |   180 |1410 |   2518

v16beta:

 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1096 |   654 |   182 | 814 |   1403


As it stands now, for v17 I'm inclined to go back to v15, but without the
attempt at being clever that seems to have slowed it down from v14.

Any interest in testing on 64-bit Arm?

--
John Naylor
EDB: http://www.enterprisedb.com
diff --git a/src/common/wchar.c b/src/common/wchar.c
index 0636b8765b..177c607be6 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -13,8 +13,47 @@
 #include "c.h"
 
 #include "mb/pg_wchar.h"
+#include "port/pg_bswap.h"
 
 
+/* for UTF-8 */
+#define IS_CONTINUATION_BYTE(c)(((c) & 0xC0) == 0x80)
+#define IS_TWO_BYTE_LEAD(c)(((c) & 0xE0) == 0xC0)
+#define IS_THREE_BYTE_LEAD(c)  (((c) & 0xF0) == 0xE0)
+#define IS_FOUR_BYTE_LEAD(c)   (((c) & 0xF8) == 0xF0)
+
+/* Verify a chunk of bytes for valid ASCII including a zero-byte check. */
+static inline int
+check_ascii(const uint64 chunk)
+{
+   uint64
+   highbits_set,
+   highbit_carry;
+
+   /* Check if any bytes in this chunk have the high bit set. */
+   highbits_set = chunk & UINT64CONST(0x8080808080808080);
+   if (highbits_set)
+   return 0;
+
+   /*
+* Check if there are any zero bytes in this chunk.
+*
+* First, add 0x7f to each byte. This sets the high bit in each byte,
+* unless it was a zero. We already checked that none of the bytes had 
the
+* high bit set previously, so the max value each byte can have after 
the
+* addition is 0x7f + 0x7f = 0xfe, and we don't need to worry about
+* carrying over to the next byte.
+*/
+   highbit_carry = chunk + UINT64CONST(0x7f7f7f7f7f7f7f7f);
+
+   /* Then check that the high bit is set in each byte. */
+   highbit_carry &= UINT64CONST(0x8080808080808080);
+   if (highbit_carry == UINT64CONST(0x8080808080808080))
+   return sizeof(chunk);
+   else
+   return 0;
+}
+
 /*
  * Operations on multi-byte encodings are driven by a table of helper
  * functions.
@@ -1728,6 +1767,97 @@ pg_gb18030_verifyst

Re: truncating timestamps on arbitrary intervals

2021-07-23 Thread John Naylor
On Thu, Jul 22, 2021 at 4:49 PM Bauyrzhan Sakhariyev <
baurzhansahar...@gmail.com> wrote:
>
> > No, the boundary is intentionally the earlier one:
>
> I found that commit in GitHub, thanks for pointing it out.
> When I test locally origin_in_the_future case I get different results for
positive and negative intervals (see queries #1 and #2 from above, they
have same timestamp, origin and interval magnitude, difference is only in
interval sign) - can it be that the version I downloaded from
https://www.enterprisedb.com/postgresql-early-experience doesn't include
commit with that improvement?

Sorry, I wasn't clear. The intention is that the boundary is on the lower
side, but query #1 doesn't follow that, so that's a bug in my view. I found
while developing the feature that the sign of the stride didn't seem to
matter, but evidently I didn't try with the origin in the future.

> >  I wonder if we should just disallow negative intervals here.
>
> I cannot imagine somebody using negative as a constant argument but users
can pass another column as a first argument date or some function(ts) - not
likely but possible. A line in docs about the leftmost point of interval as
start of the bin could be helpful.

In recent years there have been at least two attempts to add an absolute
value function for intervals, and both stalled over semantics, so I'd
rather just side-step the issue, especially as we're in beta.

> >In the case of full units (1 minute, 1 hour, etc.), it gives the same
result as the analogous date_trunc call,
> Was not obvious to me that we need to supply Monday origin to make
date_bin(1 week, ts) produce same result with date_trunc

The docs for date_trunc() don't mention this explicitly, but it might be
worth mentioning ISO weeks. There is a nearby mention for EXTRACT():

https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT

"The number of the ISO 8601 week-numbering week of the year. By definition,
ISO weeks start on Mondays and the first week of a year contains January 4
of that year. In other words, the first Thursday of a year is in week 1 of
that year."

> Sorry for the verbose report and thanks for the nice function -  I know
it's not yet released, was just playing around with beta as I want to align
CrateDB date_bin with Postgresql

Thanks again for testing! This is good feedback.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: truncating timestamps on arbitrary intervals

2021-07-27 Thread John Naylor
I wrote:

> On Thu, Jul 22, 2021 at 4:49 PM Bauyrzhan Sakhariyev <
baurzhansahar...@gmail.com> wrote:
> >
> > > No, the boundary is intentionally the earlier one:
> >
> > I found that commit in GitHub, thanks for pointing it out.
> > When I test locally origin_in_the_future case I get different results
for positive and negative intervals (see queries #1 and #2 from above, they
have same timestamp, origin and interval magnitude, difference is only in
interval sign) - can it be that the version I downloaded from
https://www.enterprisedb.com/postgresql-early-experience doesn't include
commit with that improvement?
>
> Sorry, I wasn't clear. The intention is that the boundary is on the lower
side, but query #1 doesn't follow that, so that's a bug in my view. I found
while developing the feature that the sign of the stride didn't seem to
matter, but evidently I didn't try with the origin in the future.
>
> > >  I wonder if we should just disallow negative intervals here.
> >
> > I cannot imagine somebody using negative as a constant argument but
users can pass another column as a first argument date or some function(ts)
- not likely but possible. A line in docs about the leftmost point of
interval as start of the bin could be helpful.
>
> In recent years there have been at least two attempts to add an absolute
value function for intervals, and both stalled over semantics, so I'd
rather just side-step the issue, especially as we're in beta.

Concretely, I propose to push the attached on master and v14. Since we're
in beta 2 and this thread might not get much attention, I've CC'd the RMT.

--
John Naylor
EDB: http://www.enterprisedb.com


0001-Disallow-negative-strides-in-date_bin.patch
Description: Binary data


Re: speed up verifying UTF-8

2021-07-26 Thread John Naylor
On Mon, Jul 26, 2021 at 7:55 AM Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:
>
> Just wondering, do you have the code in a GitHub/Gitlab branch?
>
> >+ utf8_advance(s, state, len);
> >+
> >+ /*
> >+ * If we saw an error during the loop, let the caller handle it. We
treat
> >+ * all other states as success.
> >+ */
> >+ if (state == ERR)
> >+ return 0;
>
> Did you mean state = utf8_advance(s, state, len); there? (reassign state
variable)

Yep, that's a bug, thanks for catching!

> >I wanted to try different strides for the DFA
>
> Does that (and "len >= 32" condition) mean the patch does not improve
validation of the shorter strings (the ones less than 32 bytes)?

Right. Also, the 32 byte threshold was just a temporary need for testing
32-byte stride -- testing different thresholds wouldn't hurt.  I'm not
terribly concerned about short strings, though, as long as we don't
regress.  That said, Heikki had something in his v14 [1] that we could use:

+/*
+ * Subroutine of pg_utf8_verifystr() to check on char. Returns the length
of the
+ * character at *s in bytes, or 0 on invalid input or premature end of
input.
+ *
+ * XXX: could this be combined with pg_utf8_verifychar above?
+ */
+static inline int
+pg_utf8_verify_one(const unsigned char *s, int len)

It would be easy to replace pg_utf8_verifychar with this. It might even
speed up the SQL function length_in_encoding() -- that would be a better
reason to do it.

[1]
https://www.postgresql.org/message-id/2f95e70d-4623-87d4-9f24-ca534155f179%40iki.fi
--
John Naylor
EDB: http://www.enterprisedb.com


Re: speed up verifying UTF-8

2021-07-26 Thread John Naylor
Attached is v20, which has a number of improvements:

1. Cleaned up and explained DFA coding.
2. Adjusted check_ascii to return bool (now called is_valid_ascii) and to
produce an optimized loop, using branch-free accumulators. That way, it
doesn't need to be rewritten for different input lengths. I also think it's
a bit easier to understand this way.
3. Put SSE helper functions in their own file.
4. Mostly-cosmetic edits to the configure detection.
5. Draft commit message.

With #2 above in place, I wanted to try different strides for the DFA, so
more measurements (hopefully not much more of these):

Power8, gcc 4.8

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
2944 |  1523 |   871 |1473 |   1509

v20, 8-byte stride:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1189 |   550 |   246 | 600 |936

v20, 16-byte stride (in the actual patch):
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 981 |   440 |   134 | 791 |820

v20, 32-byte stride:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 857 |   481 |   141 | 834 |839

Based on the above, I decided that 16 bytes had the best overall balance.
Other platforms may differ, but I don't expect it to make a huge amount of
difference.

Just for fun, I was also a bit curious about what Vladimir mentioned
upthread about x86-64-v3 offering a different shift instruction. Somehow,
clang 12 refused to build with that target, even though the release notes
say it can, but gcc 11 was fine:

x86 Macbook, gcc 11, USE_FALLBACK_UTF8=1:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1200 |   728 |   370 | 544 |637

v20:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 459 |   243 |77 | 424 |440

v20, CFLAGS="-march=x86-64-v3 -O2" :
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 390 |   215 |77 | 303 |323

And, gcc does generate the desired shift here:

objdump -S src/port/pg_utf8_fallback.o | grep shrx
  53: c4 e2 eb f7 d1   shrxq %rdx, %rcx, %rdx

While it looks good, clang can do about as good by simply unrolling all 16
shifts in the loop, which gcc won't do. To be clear, it's irrelevant, since
x86-64-v3 includes AVX2, and if we had that we would just use it with the
SIMD algorithm.

Macbook x86, clang 12:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 974 |   691 |   370 | 456 |526

v20, USE_FALLBACK_UTF8=1:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 351 |   172 |88 | 349 |350

v20, with SSE4:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 142 |92 |59 | 141 |141

I'm pretty happy with the patch at this point.

--
John Naylor
EDB: http://www.enterprisedb.com
From c82cbcf342f986396152a743a552626757b0a2b3 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Sun, 25 Jul 2021 20:41:41 -0400
Subject: [PATCH v20] Add a fast path for validating UTF-8 text

Our previous validator is a traditional one that performs comparisons
and branching on one byte at a time. It's useful in that we always know
exactly how many bytes we have validated, but that precision comes at
a cost. Input validation can show up prominently in profiles of COPY
FROM, and future improvements to COPY FROM such as parallelism or line
and field parsing will put more pressure on input validation. Hence,
supplement with two fast path implementations:

On machines that support SSE4, use an algorithm described in the
paper "Validating UTF-8 In Less Than One Instruction Per Byte" by
John Keiser and Daniel Lemire. The authors have made available an
open source implementation within the simdjson library (Apache 2.0
license). The lookup tables and naming conventions were adopted from
this library, but the code was written from scratch.

On other hardware, use a "shift-based" DFA.

Both implementations are heavily optimized for blocks of ASCII text,
are relatively free of branching and thus robust against many kinds
of byte patterns, and delay error checking to the very end. With these
algorithms, UTF-8 validation is from anywhere from two to seven times
faster, depending on platform and the distribution of byte sequences
in the input.

The previous coding in pg_utf8_verifystr() is retained for short
strings and for when the fast path returns an error.

Review, performance testing, and additional hacking by: Heikki
Linakangas, Vladimir Sitnikov, Amit Khandekar, Thomas Munro, and
Greg Stark

Discussion:
https://www.postgresql.org/message-id/CAFBsxsEV_SzH%2BOLyCiyon%3DiwggSyMh_eF6A3LU2tiWf3Cy2ZQg%40mail.gmail

Re: speed up verifying UTF-8

2021-07-26 Thread John Naylor
On Mon, Jul 26, 2021 at 7:55 AM Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:
>
> Just wondering, do you have the code in a GitHub/Gitlab branch?

Sorry, I didn't see this earlier. No, I don't.
--
John Naylor
EDB: http://www.enterprisedb.com


Re: speed up verifying UTF-8

2021-07-15 Thread John Naylor
I wrote:

> To simplify the constants, I do shift down to uint32, and I didn't bother
working around that. v16alpha regressed on worst-case input, so for v16beta
I went back to earlier coding for the one-byte ascii check. That helped,
but it's still slower than v14.

It occurred to me that I could rewrite the switch test into simple
comparisons, like I already had for the 2- and 4-byte lead cases. While at
it, I folded the leading byte and continuation tests into a single
operation, like this:

/* 3-byte lead with two continuation bytes */
else if ((chunk & 0xF0C0C000) == 0xE0808000)

...and also tried using 64-bit constants to avoid shifting. Still didn't
quite beat v14, but got pretty close:

> The numbers on Power8 / gcc 4.8 (little endian):
>
> HEAD:
>
>  chinese | mixed | ascii | mixed16 | mixed8
> -+---+---+-+
> 2951 |  1521 |   871 |1474 |   1508
>
> v14:
>
>  chinese | mixed | ascii | mixed16 | mixed8
> -+---+---+-+
>  885 |   607 |   179 | 774 |   1325

v16gamma:

 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 952 |   632 |   180 | 800 |   1333

A big-endian 64-bit platform just might shave enough cycles to beat v14
this way... or not.

--
John Naylor
EDB: http://www.enterprisedb.com
diff --git a/src/common/wchar.c b/src/common/wchar.c
index 0636b8765b..f48d79638c 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -13,8 +13,41 @@
 #include "c.h"
 
 #include "mb/pg_wchar.h"
+#include "port/pg_bswap.h"
 
 
+/* Verify a chunk of bytes for valid ASCII including a zero-byte check. */
+static inline int
+check_ascii(const uint64 chunk)
+{
+   uint64
+   highbits_set,
+   highbit_carry;
+
+   /* Check if any bytes in this chunk have the high bit set. */
+   highbits_set = chunk & UINT64CONST(0x8080808080808080);
+   if (highbits_set)
+   return 0;
+
+   /*
+* Check if there are any zero bytes in this chunk.
+*
+* First, add 0x7f to each byte. This sets the high bit in each byte,
+* unless it was a zero. We already checked that none of the bytes had 
the
+* high bit set previously, so the max value each byte can have after 
the
+* addition is 0x7f + 0x7f = 0xfe, and we don't need to worry about
+* carrying over to the next byte.
+*/
+   highbit_carry = chunk + UINT64CONST(0x7f7f7f7f7f7f7f7f);
+
+   /* Then check that the high bit is set in each byte. */
+   highbit_carry &= UINT64CONST(0x8080808080808080);
+   if (highbit_carry == UINT64CONST(0x8080808080808080))
+   return sizeof(chunk);
+   else
+   return 0;
+}
+
 /*
  * Operations on multi-byte encodings are driven by a table of helper
  * functions.
@@ -1728,6 +1761,67 @@ pg_gb18030_verifystr(const unsigned char *s, int len)
return s - start;
 }
 
+/*
+ * Workhorse for pg_utf8_verifychar(). Returns the length of the character
+ * at *s in bytes, or -1 on invalid input or premature end of input.
+ * Static inline for the benefit of pg_utf8_verifystr().
+ */
+static inline int
+pg_utf8_verifychar_internal(const uint64 chunk_orig)
+{
+   const uint64 chunk = (pg_hton64(chunk_orig));
+
+   /* high bit should be set */
+   Assert((chunk & 0x8000) != 0);
+
+   /* 2-byte lead with one continuation byte */
+   if ((chunk & 0xE0C0) == 0xC080)
+   {
+   /* check 2-byte overlong: 1100.000x.10xx. */
+   if (chunk < 0xC200)
+   return -1;
+
+   /* found valid sequence for code points U+0080 through U+07FF */
+   return 2;
+   }
+   /* 3-byte lead with two continuation bytes */
+   else if ((chunk & 0xF0C0C000) == 0xE0808000)
+   {
+   /* check 3-byte overlong: 1110. 100x. 10xx. */
+   if (chunk < 0xE0A0)
+   return -1;
+
+   /* check surrogate: 1110.1101 101x. 10xx. */
+   if (chunk > 0xED9FBFFF && chunk < 0xEE00)
+   return -1;
+
+   /*
+* found valid sequence for code points U+0800 through U+D7FF or
+* U+E000 through U+
+*/
+   return 3;
+   }
+   /* 4-byte lead with three continuation bytes */
+   else if ((chunk & 0xF8C0C0C0) == 0xF0808080)
+   {
+   /*
+* check 4-byte overlong: . 1000. 10xx. 
10xx.
+*/
+   if (chunk < 0xF090)
+   return -1;
+
+ 

Re: Unused function parameter in get_qual_from_partbound()

2021-07-14 Thread John Naylor
On Mon, Jul 12, 2021 at 8:46 AM John Naylor 
wrote:
>
> On Tue, Jun 8, 2021 at 10:50 PM Michael Paquier 
wrote:
> >
> > At first glance, this looked to me like breaking something just for
> > sake of breaking it, but removing the rel argument could be helpful
> > to simplify any external code calling it as there would be no need for
> > this extra Relation.  So that looks like a good idea, no need to rush
> > that into 14 though.
>
> I found no external references in codesearch.debian.net. I plan to commit
this in the next couple of days unless there are objections.

This has been committed.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: do only critical work during single-user vacuum?

2022-01-13 Thread John Naylor
On Wed, Jan 12, 2022 at 12:26 PM Bossart, Nathan  wrote:
>
> > - For the general case, we would now have the ability to vacuum a
> > table, and possibly have no effect at all. That seems out of place
> > with the other options.
>
> Perhaps a message would be emitted when tables are specified but
> skipped due to the min-xid-age option.
>
> As I've stated upthread, Sawada-san's suggested approach was my
> initial reaction to this thread.  I'm not wedded to the idea of adding
> new options, but I think there are a couple of advantages.  For both
> single-user mode and normal operation (which may be in imminent
> wraparound danger), you could use the same command:
>
> VACUUM (MIN_XID_AGE 16, ...);

My proposed top-level statement can also be used in normal operation,
so the only possible advantage is configurability. But I don't really
see any advantage in that -- I don't think we should be moving in the
direction of adding more-intricate ways to paper over the deficiencies
in autovacuum scheduling. (It could be argued that I'm doing exactly
that in this whole thread, but [imminent] shutdown situations have
other causes besides deficient scheduling.)

> (As an aside, we'd need to figure out how XID and MXID options would
> work together.  Presumably most users would want to OR them.)
>
> This doesn't really tie in super nicely with the failsafe mechanism,
> but adding something like a FAILSAFE option doesn't seem right to me,

I agree -- it would be awkward and messy as an option. However, I see
the same problem with xid/mxid -- I would actually argue they are not
even proper options; they are "selectors". Your comments above about
1) needing to OR them and 2) emitting a message when a VACUUM command
doesn't actually do anything are evidence of that fact.

> The other advantage I see with age-related options is that it can be
> useful for non-imminent-wraparound situations as well.  For example,
> maybe a user just wants to manually vacuum everything (including
> indexes) with an age above 500M on the weekends.

There is already vaccumdb for that, and I think it's method of
selecting tables is sound -- I'm not convinced that pushing table
selection to the server command as "options" is an improvement.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2022-01-12 Thread John Naylor
On Tue, Jan 11, 2022 at 9:20 PM Justin Pryzby  wrote:
>
> On Tue, Jan 11, 2022 at 07:58:56PM -0500, John Naylor wrote:
> > + // FIXME: also check reloption
> > + // WIP: 95% is a starting point for discussion
> > + if ((table_xid_age < autovacuum_freeze_max_age * 0.95) ||
> > + (table_mxid_age < autovacuum_multixact_freeze_max_age 
> > * 0.95))
> > + continue;
>
> Should be &&

Thanks! Will fix.

> Should this emergency vacuum "order by age() DESC" ?

That would add complexity and only save a marginal amount of time.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2022-01-12 Thread John Naylor
On Wed, Jan 12, 2022 at 1:49 AM Masahiko Sawada  wrote:
> It seems to me that adding new syntax instead of a new option is less
> flexible. In the future, for instance, when we support parallel heap
> scan for VACUUM, we may want to add a parallel-related option to both
> VACUUM statement and VACUUM LIMIT statement. VACUUM LIMIT statement
> would end up becoming like VACUUM statement?

This is intended for single-user mode, so parallelism is not relevant.

> As another idea, we might be able to add a new option that takes an
> optional integer value, like VACUUM (MIN_XID), VACUUM (MIN_MXID), and
> VACUUM (MIN_XID 50). We vacuum only tables whose age is older than
> the given value. If the value is omitted, we vacuum only tables whose
> age exceeds a threshold (say autovacuum_freeze_max_age * 0.95), which
> can be used in an emergency case and output in GetNewTransactionID()
> WARNINGs output. vacuumdb’s --min-xid-age and --min-mxid-age can use
> this option instead of fetching the list of tables from the server.

That could work, and maybe also have general purpose, but I see two
problems if I understand you correctly:

- If we have a default threshold when the values are omitted, that
implies we need to special-case single-user mode with non-obvious
behavior, which is not ideal, as Andres mentioned upthread. (Or, now
manual VACUUM by default would not do anything, except in extreme
cases, which is worse.)

- In the single-user case, the admin would still need to add
INDEX_CLEANUP = off for minimum downtime, and it should be really
simple.

- For the general case, we would now have the ability to vacuum a
table, and possibly have no effect at all. That seems out of place
with the other options.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2022-01-12 Thread John Naylor
On Tue, Jan 11, 2022 at 8:57 PM Peter Geoghegan  wrote:
> On Tue, Jan 11, 2022 at 4:59 PM John Naylor
> > For the PoC I wanted to try re-using existing keywords. I went with
> > "VACUUM LIMIT" since LIMIT is already a keyword that cannot be used as
> > a table name. It also brings "wraparound limit" to mind. We could add
> > a single-use unreserved keyword (such as VACUUM_MINIMAL or
> > VACUUM_FAST), but that doesn't seem great.
>
> This seems reasonable, but you could add a new option instead, without
> much downside. While INDEX_CLEANUP kind of looks like a keyword, it
> isn't really a keyword. (Perhaps you knew this already.)
>
> Making this a new option is a little awkward, admittedly. It's not
> clear what it means to "VACUUM (LIMIT) my_table" -- do you just throw
> an error for stuff like that? So perhaps your approach of adding
> VacuumMinimalStmt (a minimal variant of the VACUUM command) is better.

We'd also have to do some checks to either ignore other options or
throw an error, which seems undesirable for code maintenance. For that
reason, I prefer the separate top-level statement, but I'm open to
bike-shedding on the actual syntax. I also briefly looked into a SQL
function, but the transaction management would make that more
difficult.

> > I'm not sure what the right trade-off is, but as written I used 95% of
> > max age. It might be undesirable to end up so close to kicking off
> > uninterruptible vacuums, but the point is to get out of single-user
> > mode and back to streaming WAL as quickly as possible. It might also
> > be worth overriding the min ages as well, but haven't done so here.
>
> I wonder if we should keep autovacuum_freeze_max_age out of it -- its
> default is too conservative in general. I'm concerned that applying
> this autovacuum_freeze_max_age test during VACUUM LIMIT doesn't go far
> enough -- it may require VACUUM LIMIT to do significantly more work
> than is needed to get the system back online (while leaving a sensible
> amount of headroom). Also seems like it might be a good idea to avoid
> relying on the user configuration, given that VACUUM LIMIT is only run
> when everything is already in disarray. (Besides, it's not clear that
> it's okay to use the autovacuum_freeze_max_age GUC without also using
> the reloption of the same name.)
>
> What do you think of applying a similar test using a generic 1 billion
> XID (and 1 billion MXID) age cutoff?

I like that a lot, actually. It's simple and insulates us from
wondering about corner cases in configuration.

> The GetNewTransactionId() WARNINGs ought to be changed to reference
> VACUUM LIMIT. (You probably just didn't get around to that in this
> POC, but couldn't hurt to remind you.)

I'll do that as well as documentation after we have agreement (or at
least lack of objection) on the syntax.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-01-12 Thread John Naylor
On Mon, Jun 28, 2021 at 8:16 PM Thomas Munro  wrote:

[v4 patchset]

Hi Thomas,

(Sorry for the delay -- I have some time to put into this now.)

> The lower numbered patches are all things that are reused in many
> places, and in my humble opinion improve the notation and type safety
> and code deduplication generally when working with common types

I think 0001-0003 have had enough review previously to commit them, as
they are mostly notational. There's a small amount of bitrot, but not
enough to change the conclusions any. Also 0011 with the missing
#undef.

On Thu, Aug 5, 2021 at 7:18 PM Peter Geoghegan  wrote:
>
> If somebody wants to get a sense of what the size hit is from all of
> these specializations, I can recommend the diff feature of bloaty:
>
> https://github.com/google/bloaty/blob/master/doc/using.md#size-diffs
>
> Obviously you'd approach this by building postgres without the patch,
> and diffing that baseline to postgres with the patch. And possibly
> variations of the patch, with less or more sort specializations.

Thanks, that's a neat feature! For 0001-0003, the diff shows +700
bytes in memory, so pretty small:

$ bloaty -s vm src/backend/postgres -- src/backend/postgres.orig
FILE SIZEVM SIZE
 --  --
  +0.0%+608  +0.0%+608.text
  +0.0% +64  +0.0% +64.eh_frame
  +0.0% +24  +0.0% +24.dynsym
  +0.0% +14  +0.0% +14.dynstr
  +0.0%  +2  +0.0%  +2.gnu.version
  +0.0% +58  [ = ]   0.debug_abbrev
  +0.1% +48  [ = ]   0.debug_aranges
  +0.0% +1.65Ki  [ = ]   0.debug_info
  +0.0%+942  [ = ]   0.debug_line
  +0.1% +26  [ = ]   0.debug_line_str
  +0.0%+333  [ = ]   0.debug_loclists
  -0.0% -23  [ = ]   0.debug_rnglists
  +0.0% +73  [ = ]   0.debug_str
  -1.0%  -4  [ = ]   0.shstrtab
  +0.0% +20  [ = ]   0.strtab
  +0.0% +24  [ = ]   0.symtab
  +131% +3.30Ki  [ = ]   0[Unmapped]
  +0.0% +7.11Ki  +0.0%+712TOTAL

[back to Thomas]

> I tried to measure a speedup in vacuum, but so far I have not.  I did
> learn some things though:  While doing that with an uncorrelated index
> and a lot of deleted tuples, I found that adding more
> maintenance_work_mem doesn't help beyond a few MB, because then cache
> misses dominate to the point where it's not better than doing multiple
> passes (and this is familiar to me from work on hash joins).  If I
> turned on huge pages on Linux and set min_dynamic_shared_memory so
> that the parallel DSM used by vacuum lives in huge pages, then
> parallel vacuum with a large maintenance_work_mem starts to do much
> better than non-parallel vacuum by improving the TLB misses (as with
> hash joins).  I thought that was quite interesting!  Perhaps
> bsearch_itemptr might help with correlated indexes with a lot of
> deleted indexes (so not dominated by cache misses), though?
>
> (I wouldn't be suprised if someone comes up with a much better idea
> than bsearch for that anyway...  a few ideas have been suggested.)

That's interesting about the (un)correlated index having such a large
effect on cache hit rate! By now there has been some discussion and a
benchmark for dead tuple storage [1]. bit there doesn't seem to be
recent activity on that thread. We might consider adding the ItemPtr
comparator work I did in [2] for v15 if we don't have any of the other
proposals in place by feature freeze. My concern there is the speedups
I observed were observed when the values were comfortably in L2 cache,
IIRC. That would need wider testing.

That said, I think what I'll do next is test the v3-0001 standalone
patch with tuplesort specializations for more data types. I already
have a decent test script that I can build on for this. (this is the
one currently in CI)

Then, I want to do at least preliminary testing of the qsort boundary
parameters.

Those two things should be doable for this commitfest.

[1] 
https://www.postgresql.org/message-id/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAFBsxsG_c24CHKA3cWrOP1HynWGLOkLb8hyZfsD9db5g-ZPagA%40mail.gmail.com

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?

2022-01-11 Thread John Naylor
On Tue, Dec 7, 2021 at 10:51 PM Bossart, Nathan  wrote:
>
> On 12/7/21, 5:21 PM, "Bharath Rupireddy" 
>  wrote:
> > On Wed, Dec 8, 2021 at 4:17 AM Bossart, Nathan  wrote:
> >> I agree with Tom.  I would just s/server/backend/ (as per the
> >> attached) and call it a day.
> >
> > Thanks. v5 patch looks good to me.
>
> I've marked the commitfest entry as ready-for-committer.

I pushed this with one small change -- I felt the comment didn't need
to explain the warning message, since it now simply matches the coding
more exactly. Also, v5 was a big enough change from v4 that I put
Nathan as the first author.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2022-01-11 Thread John Naylor
On Tue, Dec 21, 2021 at 4:56 PM Peter Geoghegan  wrote:

> But if we're going to add a new option to the VACUUM command (or
> something of similar scope), then we might as well add a new behavior
> that is reasonably exact -- something that (say) only *starts* a
> VACUUM for those tables whose relfrozenxid age currently exceeds half
> the autovacuum_freeze_max_age for the table (usually taken from the
> GUC, sometimes taken from the reloption), which also forces the
> failsafe. And with similar handling for
> relminmxid/autovacuum_multixact_freeze_max_age.

> This new command/facility should probably not be a new flag to the
> VACUUM command, as such. Rather, I think that it should either be an
> SQL-callable function, or a dedicated top-level command (that doesn't
> accept any tables). The only reason to have this is for scenarios
> where the user is already in a tough spot with wraparound failure,
> like that client of yours. Nobody wants to force the failsafe for one
> specific table. It's not general purpose, at all, and shouldn't claim
> to be.

I've attached a PoC *untested* patch to show what it would look like
as a top-level statement. If the "shape" is uncontroversial, I'll put
work into testing it and fleshing it out.

For the PoC I wanted to try re-using existing keywords. I went with
"VACUUM LIMIT" since LIMIT is already a keyword that cannot be used as
a table name. It also brings "wraparound limit" to mind. We could add
a single-use unreserved keyword (such as VACUUM_MINIMAL or
VACUUM_FAST), but that doesn't seem great.

> In other words, while triggering the failsafe is important, simply *not
> starting* VACUUM for relations where there is really no need for it is
> at least as important. We shouldn't even think about pruning or
> freezing with these tables. (ISTM that the only thing that might be a
> bit controversial about any of this is my definition of "safe", which
> seems like about the right trade-off to me.)

I'm not sure what the right trade-off is, but as written I used 95% of
max age. It might be undesirable to end up so close to kicking off
uninterruptible vacuums, but the point is to get out of single-user
mode and back to streaming WAL as quickly as possible. It might also
be worth overriding the min ages as well, but haven't done so here.

It can be executed in normal mode (although it's not expected to be),
which makes testing easier and allows for a future possibility of not
requiring shutdown at all, by e.g. terminating non-superuser
connections.

-- 
John Naylor
EDB: http://www.enterprisedb.com
 src/backend/commands/vacuum.c  | 133 +
 src/backend/nodes/copyfuncs.c  |   3 +
 src/backend/nodes/equalfuncs.c |   3 +
 src/backend/parser/gram.y  |  10 +++-
 src/backend/tcop/utility.c |  13 
 src/include/commands/vacuum.h  |   1 +
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/parsenodes.h |  12 
 8 files changed, 165 insertions(+), 11 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 287098e4d0..d1c59a78e9 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -52,6 +52,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/acl.h"
+#include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -271,10 +272,132 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 	/* user-invoked vacuum never uses this parameter */
 	params.log_min_duration = -1;
 
+	/*
+	 * Create special memory context for cross-transaction storage.
+	 *
+	 * Since it is a child of PortalContext, it will go away eventually even
+	 * if we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	vac_context = AllocSetContextCreate(PortalContext,
+		"Vacuum",
+		ALLOCSET_DEFAULT_SIZES);
+
 	/* Now go through the common routine */
 	vacuum(vacstmt->rels, , NULL, isTopLevel);
 }
 
+/*
+ * Like ExecVacuum, but specialized for recovering quickly from anti-wraparound
+ * shutdown.
+ */
+void
+ExecVacuumMinimal(VacuumMinimalStmt *fmstmt, bool isTopLevel)
+{
+	VacuumParams params;
+	List	   *vacrels = NIL;
+	Relation	pgclass;
+	TableScanDesc scan;
+	HeapTuple	tuple;
+	int32 table_xid_age;
+	int32 table_mxid_age;
+	int32 save_VacuumCostDelay;
+
+	/* use defaults */
+	// WIP: It might be worth trying to do less work here
+	params.freeze_min_age = -1;
+	params.multixact_freeze_min_age = -1;
+
+	/* it's unlikely any table we choose will not be eligible for aggressive vacuum, but make sure */
+	params.freeze_table_age = 0;
+	params.multixact_freeze_table_age = 0;
+
+	/* skip unnecessary work, as in failsafe mode */
+	params.index_cleanup = VACOPTVALUE_DIS

Re: Non-decimal integer literals

2022-02-13 Thread John Naylor
On Wed, Jan 26, 2022 at 10:10 PM Peter Eisentraut
 wrote:
> [v8 patch]

0001-0004 seem pretty straightforward.

0005:

 {realfail1} {
- /*
- * throw back the [Ee], and figure out whether what
- * remains is an {integer} or {decimal}.
- */
- yyless(yyleng - 1);
  SET_YYLLOC();
- return process_integer_literal(yytext, yylval);
+ yyerror("trailing junk after numeric literal");
  }

realfail1 has been subsumed by integer_junk and decimal_junk, so that
pattern can be removed.

 {
+/*
+ * Note that some trailing junk is valid in C (such as 100LL), so we contain
+ * this to SQL mode.
+ */

It seems Flex doesn't like C comments after the "%%", so this stanza
was indented in 0006. If these are to be committed separately, that
fix should happen here.

0006:

Minor nit -- the s/decimal/numeric/ change doesn't seem to have any
notational advantage, but it's not worse, either.

0007:

I've attached an addendum to restore the no-backtrack property.

Will the underscore syntax need treatment in the input routines as well?

-- 
John Naylor
EDB: http://www.enterprisedb.com
diff --git a/src/backend/parser/Makefile b/src/backend/parser/Makefile
index 827bc4c189..5ddb9a92f0 100644
--- a/src/backend/parser/Makefile
+++ b/src/backend/parser/Makefile
@@ -56,7 +56,7 @@ gram.c: BISON_CHECK_CMD = $(PERL) $(srcdir)/check_keywords.pl 
$< $(top_srcdir)/s
 
 
 scan.c: FLEXFLAGS = -CF -p -p
-#scan.c: FLEX_NO_BACKUP=yes
+scan.c: FLEX_NO_BACKUP=yes
 scan.c: FLEX_FIX_WARNING=yes
 
 
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index 5b574c4233..3b311ac2dd 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -400,9 +400,9 @@ hexinteger  0[xX](_?{hexdigit})+
 octinteger 0[oO](_?{octdigit})+
 bininteger 0[bB](_?{bindigit})+
 
-hexfail0[xX]
-octfail0[oO]
-binfail0[bB]
+hexfail0[xX]_?
+octfail0[oO]_?
+binfail0[bB]_?
 
 numeric(({decinteger}\.{decinteger}?)|(\.{decinteger}))
 numericfail{decdigit}+\.\.


Re: Consistently use "startup process"/"WAL sender" instead of "Startup process"/"WAL Sender" in comments and docs.

2022-02-13 Thread John Naylor
On Sun, Feb 13, 2022 at 3:13 PM Bharath Rupireddy
 wrote:
>
> Hi,
>
> In the code comments, it is being used as "Startup process" in the
> middle of the sentences whereas in most of the places "startup
> process" is used. Also, "WAL Sender" is being used instead of "WAL
> sender". Let's be consistent across the docs and code comments.

FWIW, docs need to hold to a higher standard than code comments.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Consistently use "startup process"/"WAL sender" instead of "Startup process"/"WAL Sender" in comments and docs.

2022-02-14 Thread John Naylor
On Mon, Feb 14, 2022 at 11:12 AM Bharath Rupireddy
 wrote:

> > FWIW, docs need to hold to a higher standard than code comments.
>
> Thanks. In general, I agree that the docs and error/log messages
> (user-facing) need to be of higher standard, but at the same time code
> comments too IMHO. Because many hackers/developers consider code
> comments a great place to learn the internals, being consistent there
> does no harm.

git log and git blame are more useful when they are free of minor
stylistic changes. Of course, corrections and clarity are different
matters that are worth fixing in the comments. I've pushed the doc
fixes, thanks for the patch!


--
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2022-02-14 Thread John Naylor
On Tue, Feb 15, 2022 at 11:22 AM Peter Geoghegan  wrote:
>
> On Mon, Feb 14, 2022 at 8:04 PM John Naylor
>  wrote:
> > The failsafe mode does disable truncation as of v14:
> >
> > commit 60f1f09ff44308667ef6c72fbafd68235e55ae27
> > Author: Peter Geoghegan 
> > Date:   Tue Apr 13 12:58:31 2021 -0700
> >
> > Don't truncate heap when VACUUM's failsafe is in effect.
>
> That's true, but bear in mind that it only does so when the specific
> table being vacuumed actually triggers the failsafe. I believe that
> VACUUM(EMERGENCY) doesn't just limit itself to vacuuming tables where
> this is guaranteed (or even likely). If I'm not mistaken, it's
> possible (even likely) that there will be a table whose
> age(relfrozenxid) is high enough for VACUUM(EMERGENCY) to target the
> table, and yet not so high that the failsafe will kick in at the
> earliest opportunity.

Well, the point of inventing this new vacuum mode was because I
thought that upon reaching xidStopLimit, we couldn't issue commands,
period, under the postmaster. If it was easier to get a test instance
to xidStopLimit, I certainly would have discovered this sooner. When
Andres wondered about getting away from single user mode, I assumed
that would involve getting into areas too deep to tackle for v15. As
Robert pointed out, lazy_truncate_heap is the only thing that can't
happen for vacuum at this point, and fully explains why in versions <
14 our client's attempts to vacuum resulted in error. Since the
failsafe mode turns off truncation, vacuum should now *just work* near
wraparound. If there is any doubt, we can tighten the check for
entering failsafe.

Now, it's certainly possible that autovacuum is either not working at
all because of something broken, or is not working on the oldest
tables at the moment, so one thing we could do is to make VACUUM [with
no tables listed] get the tables from pg_class in reverse order of
max(xid age, mxid age). That way, the horizon will eventually pull
back over time and the admin can optionally cancel the vacuum at some
point. Since the order is harmless when it's not needed, we can do
that unconditionally.
-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2022-02-14 Thread John Naylor
On Fri, Feb 4, 2022 at 4:58 AM Robert Haas  wrote:

> As I said
> before, I know TRUNCATE has been an issue in the past, and if that's
> not already fixed in v14, we should. If there's other stuff, we should
> fix that too.

The failsafe mode does disable truncation as of v14:

commit 60f1f09ff44308667ef6c72fbafd68235e55ae27
Author: Peter Geoghegan 
Date:   Tue Apr 13 12:58:31 2021 -0700

Don't truncate heap when VACUUM's failsafe is in effect.
--

To demonstrate to myself, I tried a few vacuums in a debugger session
with a breakpoint at GetNewTransactionId(). I've only seen it reach
here when heap truncation happens (or the not relevant for wraparound
situations FULL and ANALYZE).

With the maximum allowable setting of autovacuum_freeze_max_age of 2
billion, the highest allowable vacuum_failsafe_age is 2.1 billion, so
heap truncation will be shut off before the warnings start.

> And then we should KILL WITH FIRE
> the message telling people to use single user mode -- and once we do
> that, the question of what the behavior ought to be when someone does
> run VACUUM in single user mode becomes a lot less important.

Okay, so it sounds like changing the message is enough for v15? The
other two things mentioned are nice-to-haves, but wouldn't need to
hold back this minimal change, it seems:

On Fri, Feb 4, 2022 at 4:50 AM Andres Freund  wrote:

> I wonder if we shouldn't add some exceptions to the xid allocation
> prevention. It makes sense that we don't allow random DML. But it's e.g. often
> more realistic to drop / truncate a few tables with unimportant content,
> rather than spend the time vacuuming those.  We could e.g. allow xid
> consumption within VACUUM, TRUNCATE, DROP TABLE / INDEX when run at the top
> level for longer than we allow it for anything else.

It seems like this would require having access to "nodetag(parsetree)"
of the statement available in GetNewTransactionId. I don't immediately
see an easy way to do that...is a global var within the realm of
acceptability?

On Fri, Feb 4, 2022 at 8:35 AM Andres Freund  wrote:

> we'd actually tell the user a bit more what about what is causing the
> problem.
>
> We can compute the:
> 1) oldest slot by xmin, with name
> 2) oldest walsender by xmin, with pid
> 3) oldest prepared transaction id by xid / xmin, with name
> 4) oldest in-progress transaction id by xid / xmin, with name
> 5) oldest database datfrozenxid, with database name
[...]
> Also, adding an SRF providing the above in a useful format would be great for
> monitoring and for "remote debugging" of problems.

I concur it sounds very useful, and not terribly hard, but probably a
v16 project.

--
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2022-02-16 Thread John Naylor
On Wed, Feb 16, 2022 at 6:17 AM Peter Geoghegan  wrote:
>
> On Tue, Feb 15, 2022 at 9:28 AM Peter Geoghegan  wrote:

> > I did notice from my own testing of the failsafe (by artificially
> > inducing wraparound failure using an XID burning C function) that
> > autovacuum seemed to totally correct the problem, even when the system
> > had already crossed xidStopLimit - it came back on its own. I wasn't
> > completely sure of how robust this effect was, though.

I'll put some effort in finding any way that it might not be robust.
After that, changing the message and docs is trivial.

> It seemed worth noting this in comments above
> should_attempt_truncation(). Pushed a commit to do that just now.

Thanks for that.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: minor code correction in typecmds.c

2022-02-16 Thread John Naylor
On Wed, Feb 16, 2022 at 7:17 PM Amul Sul  wrote:
>
> Hi,
>
> The attached patch replaces the hard-coded type alignment value with
> the defined macro for the same.

Pushed, thanks!

-- 
John Naylor
EDB: http://www.enterprisedb.com




use rotate macro in more places

2022-02-19 Thread John Naylor
We've accumulated a few bit-twiddling hacks to get the compiler to
emit a rotate instruction. Since we have a macro for that, let's use
it, as in the attached. I thought the new call sites would look better
with a "left" version, so I added a new macro for that. That's not
necessary, however.

Some comments now look a bit too obvious to keep around, but maybe
they should be replaced with a "why", instead of a "what":

/* rotate hashkey left 1 bit at each step */
-   hashkey = (hashkey << 1) | ((hashkey &
0x8000) ? 1 : 0);
+   hashkey = pg_rotate_left32(hashkey, 1);


-- 
John Naylor
EDB: http://www.enterprisedb.com
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index af6e9c42d8..ce8f6cd047 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -460,7 +460,7 @@ TupleHashTableHash_internal(struct tuplehash_hash *tb,
 		bool		isNull;
 
 		/* rotate hashkey left 1 bit at each step */
-		hashkey = (hashkey << 1) | ((hashkey & 0x8000) ? 1 : 0);
+		hashkey = pg_rotate_left32(hashkey, 1);
 
 		attr = slot_getattr(slot, att, );
 
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 4d68a8b97b..6f9d0f9487 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1841,7 +1841,7 @@ ExecHashGetHashValue(HashJoinTable hashtable,
 		bool		isNull;
 
 		/* rotate hashkey left 1 bit at each step */
-		hashkey = (hashkey << 1) | ((hashkey & 0x8000) ? 1 : 0);
+		hashkey = pg_rotate_left32(hashkey, 1);
 
 		/*
 		 * Get the join attribute value of the tuple
diff --git a/src/backend/executor/nodeMemoize.c b/src/backend/executor/nodeMemoize.c
index 55cdd5c4d9..e32e4c04e6 100644
--- a/src/backend/executor/nodeMemoize.c
+++ b/src/backend/executor/nodeMemoize.c
@@ -167,7 +167,7 @@ MemoizeHash_hash(struct memoize_hash *tb, const MemoizeKey *key)
 		for (int i = 0; i < numkeys; i++)
 		{
 			/* rotate hashkey left 1 bit at each step */
-			hashkey = (hashkey << 1) | ((hashkey & 0x8000) ? 1 : 0);
+			hashkey = pg_rotate_left32(hashkey, 1);
 
 			if (!pslot->tts_isnull[i])	/* treat nulls as having hash key 0 */
 			{
@@ -190,7 +190,7 @@ MemoizeHash_hash(struct memoize_hash *tb, const MemoizeKey *key)
 		for (int i = 0; i < numkeys; i++)
 		{
 			/* rotate hashkey left 1 bit at each step */
-			hashkey = (hashkey << 1) | ((hashkey & 0x8000) ? 1 : 0);
+			hashkey = pg_rotate_left32(hashkey, 1);
 
 			if (!pslot->tts_isnull[i])	/* treat nulls as having hash key 0 */
 			{
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index eb83088089..ec073e1ed0 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -26,6 +26,7 @@
 #include "catalog/pg_type.h"
 #include "common/hashfn.h"
 #include "miscadmin.h"
+#include "port/pg_bitutils.h"
 #ifdef CATCACHE_STATS
 #include "storage/ipc.h"		/* for on_proc_exit */
 #endif
@@ -281,25 +282,18 @@ CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
 	{
 		case 4:
 			oneHash = (cc_hashfunc[3]) (v4);
-
-			hashValue ^= oneHash << 24;
-			hashValue ^= oneHash >> 8;
+			hashValue ^= pg_rotate_left32(oneHash, 24);
 			/* FALLTHROUGH */
 		case 3:
 			oneHash = (cc_hashfunc[2]) (v3);
-
-			hashValue ^= oneHash << 16;
-			hashValue ^= oneHash >> 16;
+			hashValue ^= pg_rotate_left32(oneHash, 16);
 			/* FALLTHROUGH */
 		case 2:
 			oneHash = (cc_hashfunc[1]) (v2);
-
-			hashValue ^= oneHash << 8;
-			hashValue ^= oneHash >> 24;
+			hashValue ^= pg_rotate_left32(oneHash, 8);
 			/* FALLTHROUGH */
 		case 1:
 			oneHash = (cc_hashfunc[0]) (v1);
-
 			hashValue ^= oneHash;
 			break;
 		default:
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 44c74fb974..65aa5ec29d 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -285,12 +285,18 @@ extern int	pg_popcount64(uint64 word);
 extern uint64 pg_popcount(const char *buf, int bytes);
 
 /*
- * Rotate the bits of "word" to the right by n bits.
+ * Rotate the bits of word to the right/left by n bits.
  */
 static inline uint32
 pg_rotate_right32(uint32 word, int n)
 {
-	return (word >> n) | (word << (sizeof(word) * BITS_PER_BYTE - n));
+	return (word >> n) | (word << (32 - n));
+}
+
+static inline uint32
+pg_rotate_left32(uint32 word, int n)
+{
+	return (word << n) | (word >> (32 - n));
 }
 
 #endif			/* PG_BITUTILS_H */


Re: use rotate macro in more places

2022-02-19 Thread John Naylor
On Sun, Feb 20, 2022 at 1:03 AM Yugo NAGATA  wrote:
> I think we can use this macro also in hash_multirange, hash_range,
> and JsonbHashScalarValue as in the attached patch. How about replacing
> them with the macro, too.

Good find. I also found one more in hashfn.c.

On Sat, Feb 19, 2022 at 11:28 PM Tom Lane  wrote:
> > Some comments now look a bit too obvious to keep around, but maybe
> > they should be replaced with a "why", instead of a "what":
>
> Yeah.  Maybe like "combine successive hashkeys by rotating"?

Done that way.

Pushed, thank you both for looking!
-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: initdb / bootstrap design

2022-02-15 Thread John Naylor
On Wed, Feb 16, 2022 at 9:12 AM Andres Freund  wrote:

> To me the division of labor between initdb and bootstrap doesn't make much
> sense anymore:
[...]
> I don't plan to work on this immediately, but I thought it's worth bringing up
> anyway.

Added TODO item "Rationalize division of labor between initdb and bootstrap"

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Mark all GUC variable as PGDLLIMPORT

2022-02-15 Thread John Naylor
On Tue, Feb 15, 2022 at 11:07 PM Andres Freund  wrote:

> > diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
> > index a4b5dc853b..13849a3790 100644
> > --- a/src/include/common/relpath.h
> > +++ b/src/include/common/relpath.h
> > @@ -56,7 +56,7 @@ typedef enum ForkNumber
> >
> >  #define FORKNAMECHARS4   /* max chars for a fork name 
> > */
> >
> > -extern const char *const forkNames[];
> > +extern PGDLLIMPORT const char *const forkNames[];
> >
> >  extern ForkNumber forkname_to_number(const char *forkName);
> >  extern int   forkname_chars(const char *str, ForkNumber *fork);
>
> I think we might need a new form of PGDLLIMPORT to mark these correctly - I
> don't think they should be marked PGDLLIMPORT in frontend environment.

That has been taken care of already, IIUC:

commit e04a8059a74c125a8af94acdcb7b15b92188470a
Author: Tom Lane 
Date:   Mon Nov 29 11:00:00 2021 -0500

Simplify declaring variables exported from libpgcommon and libpgport.

This reverts commits c2d1eea9e and 11b500072, as well as similar hacks
elsewhere, in favor of setting up the PGDLLIMPORT macro so that it can
just be used unconditionally.  That can work because in frontend code,
we need no marking in either the defining or consuming files for a
variable exported from these libraries; and frontend code has no need
to access variables exported from the core backend, either.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: some aspects of our qsort might not be ideal

2022-02-15 Thread John Naylor
> [3] 
> https://www.postgresql.org/message-id/flat/87F42982BF2B434F831FCEF4C45FC33E5BD369DF%40EXCHANGE.corporate.connx.com#e69718293c45d89555020bd0922ad055

The top of that thread is where I meant to point to:

https://www.postgresql.org/message-id/flat/CAEYLb_Xn4-6f1ofsf2qduf24dDCVHbQidt7JPpdL_RiT1zBJ6A%40mail.gmail.com

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: some aspects of our qsort might not be ideal

2022-02-15 Thread John Naylor
> This way, we *dynamically*
> decide to optimistically start an insertion sort, in the hopes that it
> will occasionally prevent a recursion, and worst case the input is
> slightly more sorted for the next recursion.

I should mention one detail -- we put a limit on how many iterations
we attempt before we give up and partition/recurse. The author's
implementation chooses 8 for this limit. The paper mentions this
technique in section 5.2, but is not the origin of it.

-- 
John Naylor
EDB: http://www.enterprisedb.com




some aspects of our qsort might not be ideal

2022-02-15 Thread John Naylor
While reviewing Thomas Munro's patchset to consider expanding the uses
of specialized qsort [1], I wondered about some aspects of the current
qsort implementation. For background, ours is based on Bentley &
McIlroy "Engineering a Sort Function" [2], which is a classic paper
worth studying.

1) the check for pre-sorted input at the start each and every recursion

This has been discussed before [3]. This is great if the input is
sorted, and mostly harmless if the current partition is badly sorted
enough that this check fails quickly, but it's not hard to imagine
that this is mostly a waste of cycles. There have been proposals to
base the pre-sorted check on input size [4] or to do it only once at
the beginning, but that strikes me as looking for the keys under the
lamppost because that's where the light is. The logical criterion for
proceeding to check if the input is sorted is whether we *think* the
input could be sorted. That may sound like a tautology, but consider
the case where the partitioning phase didn't perform any swaps. Then,
it makes sense to check, but we can go further. What if we do the
check, but towards the end that check fails. If just a couple elements
are out of place, does it really make sense to give up, partition, and
recurse? If just a few iterations of insertion sort are all that is
needed to finish, why not just do that? This way, we *dynamically*
decide to optimistically start an insertion sort, in the hopes that it
will occasionally prevent a recursion, and worst case the input is
slightly more sorted for the next recursion. All we need is a
lightweight way to detect that the partitioning phase didn't do any
swaps. More on this later.

2) cardinality issues can cancel abbreviated keys, but our qsort is
not optimized for that

Since in this case comparison is very expensive, it stands to reason
that qsort could profitably be optimized for a low number of unique
keys. The Bentley & McIlroy scheme does take great pains to prevent
quadratic behavior from lots of duplicates, but I've found something
that might have stronger guarantees: Pattern-defeating quicksort
(paper: [5] C++ implementation: [6]) claims worst-case runtime of
O(nk) for inputs with k distinct elements. This is achieved via an
asymmetric partitioning scheme. It's not complex, but it is subtle, so
I won't try to describe it here. I recommend reading section 3 of the
paper for details. Rust and Boost use PDQuicksort in some form, so it
has some production use.

The partitioning scheme just mentioned also provides an easy way to
detect when no swaps have been done, thus solving #1 above.

There is one requirement that is a double-edged sword: Recursion to
the partitions must happen in order. This has an additional advantage
that in every case but one, insertion sort doesn't need to guard
against running off the beginning of the array. The downside for us is
that we currently recurse to a partition based on its size as a
stack-guarding measure, so that guard would have to be replaced. The
C++ implementation of PDQuicksort falls back to heap sort as a last
resort to bound runtime, but it would be just as effective at guarding
the stack. That's a bit of a snag for making it production-ready, but
not enough to prevent testing the actual qsort part.

Does anyone see a reason not to put in the necessary work to try it out?

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGKztHEWm676csTFjYzortziWmOcf8HDss2Zr0muZ2xfEg%40mail.gmail.com
[2] https://cs.fit.edu/~pkc/classes/writing/samples/bentley93engineering.pdf
[3] 
https://www.postgresql.org/message-id/flat/87F42982BF2B434F831FCEF4C45FC33E5BD369DF%40EXCHANGE.corporate.connx.com#e69718293c45d89555020bd0922ad055
[4] 
https://www.postgresql.org/message-id/CA%2BTgmobKFcb6ajVEN8eSnBa78sB3FSwuEWTHXd2x9JC5DOh5OA%40mail.gmail.com
[5] https://arxiv.org/pdf/2106.05123.pdf
[6] https://github.com/orlp/pdqsort

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2022-03-31 Thread John Naylor
On Wed, Mar 16, 2022 at 4:48 AM Peter Geoghegan  wrote:
>
> On Wed, Feb 16, 2022 at 12:43 AM John Naylor
>  wrote:
> > I'll put some effort in finding any way that it might not be robust.
> > After that, changing the message and docs is trivial.
>
> It would be great to be able to totally drop the idea of using
> single-user mode before Postgres 15 feature freeze. How's that going?

Unfortunately, I was distracted from this work for a time, and just as
I had intended to focus on it during March, I was out sick for 2-3
weeks. I gather from subsequent discussion that a full solution goes
beyond just a new warning message and documentation. Either way I'm
not quite prepared to address this in time for v15.

> I suggest that we apply the following patch as part of that work. It
> adds one last final failsafe check at the point that VACUUM makes a
> final decision on rel truncation.

That is one thing that was in the back of my mind, and it seems
reasonable to me.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-03-31 Thread John Naylor
In a couple days I'm going to commit the v3 patch "accelerate tuple
sorting for common types" as-is after giving it one more look, barring
objections.

I started towards incorporating the change in insertion sort threshold
(part of 0010), but that caused regression test failures, so that will
have to wait for a bit of analysis and retesting. (My earlier tests
were done in a separate module.)

The rest in this series that I looked at closely were either
refactoring or could use some minor tweaks so likely v16 material.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-04-02 Thread John Naylor
On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro  wrote:
> It looks like UBsan sees a problem, per BF animal kestrel:
>
> /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/utils/sort/tuplesort.c:722:51:
> runtime error: load of value 96, which is not a valid value for type
> 'bool'

Yeah, same with tamandua. Then, skink (a Valgrind animal) shows:

==1940791== VALGRINDERROR-BEGIN
==1940791== Conditional jump or move depends on uninitialised value(s)
==1940791==at 0x73D394: ApplyInt32SortComparator (sortsupport.h:311)
==1940791==by 0x73D394: qsort_tuple_int32_compare (tuplesort.c:722)
==1940791==by 0x73D394: qsort_tuple_int32 (sort_template.h:313)
==1940791==by 0x7409BC: tuplesort_sort_memtuples (tuplesort.c:3613)
==1940791==by 0x742806: tuplesort_performsort (tuplesort.c:2154)
==1940791==by 0x23C109: heapam_relation_copy_for_cluster
(heapam_handler.c:955)
==1940791==by 0x35799A: table_relation_copy_for_cluster (tableam.h:1658)
==1940791==by 0x35799A: copy_table_data (cluster.c:913)
==1940791==by 0x359016: rebuild_relation (cluster.c:606)
==1940791==by 0x35914E: cluster_rel (cluster.c:427)
==1940791==by 0x3594EB: cluster (cluster.c:195)
==1940791==by 0x5C73FF: standard_ProcessUtility (utility.c:862)
==1940791==by 0x5C78D0: ProcessUtility (utility.c:530)
==1940791==by 0x5C4C7B: PortalRunUtility (pquery.c:1158)
==1940791==by 0x5C4F78: PortalRunMulti (pquery.c:1315)
==1940791==  Uninitialised value was created by a stack allocation
==1940791==at 0x74224E: tuplesort_putheaptuple (tuplesort.c:1800)

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-04-02 Thread John Naylor
On Fri, Apr 1, 2022 at 4:43 AM Thomas Munro  wrote:
>
> On Thu, Mar 31, 2022 at 11:09 PM John Naylor
>  wrote:
> > In a couple days I'm going to commit the v3 patch "accelerate tuple
> > sorting for common types" as-is after giving it one more look, barring
> > objections.

Pushed.

> Hi John,
>
> Thanks so much for all the work you've done here!  I feel bad that I
> lobbed so many experimental patches in here and then ran away due to
> lack of cycles.  That particular patch (the one cfbot has been chewing
> on all this time) does indeed seem committable, despite the
> deficiencies/opportunities listed in comments.  It's nice to reduce
> code duplication, it gives the right answers, and it goes faster.

Thanks for chiming in! It gives me more confidence that there wasn't
anything amiss that may have gone unnoticed. And no worries -- my own
review efforts here have been sporadic. ;-)

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-04-02 Thread John Naylor
I wrote:

> I started towards incorporating the change in insertion sort threshold
> (part of 0010), but that caused regression test failures, so that will
> have to wait for a bit of analysis and retesting. (My earlier tests
> were done in a separate module.)

The failures seem to be where sort order is partially specified. E.g.
ORDER BY col_a, where there are duplicates there and other columns are
different. Insertion sort is stable IIRC, so moving the threshold
caused different orders in these cases. Some cases can be conveniently
fixed with additional columns in the ORDER BY clause. I'll go through
the failures and see how much can be cleaned up as a preparatory
refactoring.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-04-02 Thread John Naylor
On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro  wrote:
> Reproduced locally, using the same few lines from the cluster.sql
> test.  I'll try to dig more tomorrow...

Thanks! Unfortunately I can't reproduce locally with clang 13/gcc 11,
with -Og or -O2 with CFLAGS="-fsanitize=undefined,alignment" ...

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: CLUSTER sort on abbreviated expressions is broken

2022-04-03 Thread John Naylor
On Sun, Apr 3, 2022 at 11:05 AM Thomas Munro  wrote:
>
> Hi,
>
> Independently of a problem with a recent commit, it seems that
> $SUBJECT in all releases (well, I only tested as far back as 11).

I can confirm the problem on v10 as well.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2022-02-01 Thread John Naylor
On Thu, Jan 27, 2022 at 8:28 PM Justin Pryzby  wrote:

> I'm sure you meant "&" here (fixed in attached patch to appease the cfbot):
> +   if (options | VACOPT_MINIMAL)

Thanks for catching that! That copy-pasto was also masking my failure
to process the option properly -- fixed in the attached as v5.

> It should either refuse to run if a list of tables is specified with MINIMAL,
> or it should filter that list by XID condition.

I went with the former for simplicity. As a single-purpose option, it
makes sense.

> As for the name, it could be MINIMAL or FAILSAFE or EMERGENCY or ??
> I think the name should actually be a bit more descriptive, and maybe say XID,
> like MINIMAL_XID or XID_EMERGENCY...

I went with EMERGENCY in this version to reinforce its purpose in the
mind of the user (and reader of this code).

> Normally, options are independent, but VACUUM (MINIMAL) is a "shortcut" to a
> hardcoded set of options: freeze on, truncate off, cleanup off.  So it refuses
> to be combined with other options - good.
>
> This is effectively a shortcut to hypothetical parameters for selecting tables
> by XID/MXID age.  In the future, someone could debate adding user-facing knobs
> for table selection by age.

I used the params struct in v5 for the emergency cutoff ages. Even
with the values hard-coded, it seems cleaner to keep them here.

> I still wonder if the relations should be processed in order of decreasing 
> age.
> An admin might have increased autovacuum_freeze_max_age up to 2e9, and your
> query might return thousands of tables, with a wide range of sizes and ages.
>
> Processing them in order of decreasing age would allow the admin to quickly
> vacuum the oldest tables, and optionally interrupt vacuum to get out of single
> user mode ASAP - even if their just want to run VACUUM(MINIMAL) in a normal
> backend when services aren't offline.  Processing them out of order might be
> pretty surprising - they might run vacuum for an hour (or overnight), cancel
> it, attempt to start the DB in normal mode, and conclude that it made no
> visible progress.

While that seems like a nice property to have, it does complicate
things, so can be left for follow-on work.

Also in v5:

- It mentions the new command in the error hint in
GetNewTransactionId(). I'm not sure if multi-word commands should be
quoted like this.
- A first draft of documentation

--
John Naylor
EDB: http://www.enterprisedb.com
 doc/src/sgml/maintenance.sgml   |  12 ++--
 doc/src/sgml/ref/vacuum.sgml|  22 
 src/backend/access/transam/varsup.c |   4 +-
 src/backend/commands/vacuum.c   | 107 +---
 src/include/commands/vacuum.h   |   5 ++
 5 files changed, 134 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 36f975b1e5..5c36049950 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -629,17 +629,19 @@ HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that data
 
 
 ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
-HINT:  Stop the postmaster and vacuum that database in single-user mode.
+HINT:  Stop the postmaster and run "VACUUM (EMERGENCY)" in that database in single-user mode.
 
 
 The three-million-transaction safety margin exists to let the
 administrator recover without data loss, by manually executing the
-required VACUUM commands.  However, since the system will not
+required VACUUM command.  However, since the system will not
 execute commands once it has gone into the safety shutdown mode,
 the only way to do this is to stop the server and start the server in single-user
-mode to execute VACUUM.  The shutdown mode is not enforced
-in single-user mode.  See the  reference
-page for details about using single-user mode.
+mode to execute VACUUM (EMERGENCY). The EMERGENCY option
+is recommended, since it enables the vacuum to complete as quickly as possible
+while still leaving a safety margin for when the system comes back online again.
+The shutdown mode is not enforced in single-user mode.
+See the  reference page for details about using single-user mode.

 

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 3df32b58ee..2dab01ff37 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -295,6 +295,28 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ boolean
 
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 748120a012..ee9d33dba3 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -128,14 +128,14 @@ GetNewTransactionId(bool isSubXact)
 		(errcode(ERRCODE_

Re: A qsort template

2022-02-02 Thread John Naylor
I wrote:

> > 0010 - Thresholds on my TODO list.
>
> I did some basic tests on the insertion sort thresholds, and it looks
> like we could safely and profitably increase the current value from 7
> to 20 or so, in line with other more recent implementations. I've
> attached an addendum on top of 0012 and the full test results on an
> Intel Coffee Lake machine with gcc 11.1. I found that the object test
> setup in 0012 had some kind of bug that was comparing the pointer of
> the object array. Rather than fix that, I decided to use Datums, but
> with the two extremes in comparator: simple branching with machine
> instructions vs. a SQL-callable function. The papers I've read
> indicate the results for Datum sizes would not be much different for
> small structs. The largest existing sort element is SortTuple, but
> that's only 24 bytes and has a bulky comparator as well.
>
> The first thing to note is that I rejected outright any testing of a
> "middle value" where the pivot is simply the middle of the array. Even
> the Bently and McIlroy paper which is the reference for our
> implementation says "The range that consists of the single integer 7
> could be eliminated, but has been left adjustable because on some
> machines larger ranges are a few percent better".
>
> I tested thresholds up to 64, which is where I guessed results to get
> worse (most implementations are smaller than that). Here are the best
> thresholds at a quick glance:
>
> - elementary comparator:
>
> random: 16 or greater
> decreasing, rotate: get noticeably better all the way up to 64
> organ: little difference, but seems to get better all the way up to 64
> 0/1: seems to get worse above 20
>
> - SQL-callable comparator:
>
> random: between 12 and 20, but slight differences until 32
> decreasing, rotate: get noticeably better all the way up to 64
> organ: seems best at 12, but slight differences until 32
> 0/1: slight differences
>
> Based on these tests and this machine, it seems 20 is a good default
> value. I'll repeat this test on one older Intel and one non-Intel
> platform with older compilers.

The above was an Intel Comet Lake / gcc 11, and I've run the same test
on a Haswell-era Xeon / gcc 8 and a Power8 machine / gcc 4.8. The
results on those machines are pretty close to the above (full results
attached). The noticeable exception is the Power8 on random input with
a slow comparator -- those measurements there are more random than
others so we can't draw conclusions from them, but the deviations are
small in any case. I'm still thinking 20 or so is about right.

I've put a lot out here recently, so I'll take a break now and come
back in a few weeks.

(no running tally here because the conclusions haven't changed since
last message)
-- 
John Naylor
EDB: http://www.enterprisedb.com
NOTICE:  [direct] size=8MB, order=random, threshold=7, test=0, time=0.130188
NOTICE:  [direct] size=8MB, order=random, threshold=7, test=1, time=0.124503
NOTICE:  [direct] size=8MB, order=random, threshold=7, test=2, time=0.124557
NOTICE:  [direct] size=8MB, order=random, threshold=12, test=0, time=0.119103
NOTICE:  [direct] size=8MB, order=random, threshold=12, test=1, time=0.119035
NOTICE:  [direct] size=8MB, order=random, threshold=12, test=2, time=0.086948
NOTICE:  [direct] size=8MB, order=random, threshold=16, test=0, time=0.082710
NOTICE:  [direct] size=8MB, order=random, threshold=16, test=1, time=0.082771
NOTICE:  [direct] size=8MB, order=random, threshold=16, test=2, time=0.083004
NOTICE:  [direct] size=8MB, order=random, threshold=20, test=0, time=0.080768
NOTICE:  [direct] size=8MB, order=random, threshold=20, test=1, time=0.080764
NOTICE:  [direct] size=8MB, order=random, threshold=20, test=2, time=0.080635
NOTICE:  [direct] size=8MB, order=random, threshold=24, test=0, time=0.080678
NOTICE:  [direct] size=8MB, order=random, threshold=24, test=1, time=0.080555
NOTICE:  [direct] size=8MB, order=random, threshold=24, test=2, time=0.080604
NOTICE:  [direct] size=8MB, order=random, threshold=28, test=0, time=0.079002
NOTICE:  [direct] size=8MB, order=random, threshold=28, test=1, time=0.078901
NOTICE:  [direct] size=8MB, order=random, threshold=28, test=2, time=0.082200
NOTICE:  [direct] size=8MB, order=random, threshold=32, test=0, time=0.079317
NOTICE:  [direct] size=8MB, order=random, threshold=32, test=1, time=0.079164
NOTICE:  [direct] size=8MB, order=random, threshold=32, test=2, time=0.079308
NOTICE:  [direct] size=8MB, order=random, threshold=64, test=0, time=0.078604
NOTICE:  [direct] size=8MB, order=random, threshold=64, test=1, time=0.078718
NOTICE:  [direct] size=8MB, order=random, threshold=64, test=2, time=0.078689
NOTICE:  [direct] size=8MB, order=decreasing, threshold=7, test=0, time=0.025097
NOTICE:  [direct] size=8MB, order=decreasing, threshold=7, test=1, time=0.025078
NO

Re: speed up text_position() for utf-8

2022-02-02 Thread John Naylor
I wrote:

> I tried this test on a newer CPU, and 0003 had no regression. Both
> systems used gcc 11.2. Rather than try to figure out why an experiment
> had unexpected behavior, I plan to test 0001 and 0002 on a couple
> different compilers/architectures and call it a day. It's also worth
> noting that 0002 by itself seemed to be decently faster on the newer
> machine, but not as fast as 0001 and 0002 together.

I tested four machines with various combinations of patches, and it
seems the only thing they all agree on is that 0002 is a decent
improvement (full results attached). The others can be faster or
slower. 0002 also simplifies things, so it has that going for it. I
plan to commit that this week unless there are objections.
-- 
John Naylor
EDB: http://www.enterprisedb.com
haswell xeon / gcc 8

master:
Time: 608.047 ms
Time: 2859.939 ms (00:02.860)
Time: 2255.402 ms (00:02.255)

0001 only:
Time: 603.026 ms
Time: 3077.163 ms (00:03.077)
Time: 2377.189 ms (00:02.377)

0002 only:
Time: 615.444 ms
Time: 2388.939 ms (00:02.389)
Time: 2227.665 ms (00:02.228)

0001-0002:
Time: 608.757 ms
Time: 2746.711 ms (00:02.747)
Time: 2156.545 ms (00:02.157)

0001-0003:
Time: 609.236 ms
Time: 1466.130 ms (00:01.466)
Time: 1353.658 ms (00:01.354)


power8 / gcc 4.8

master:
Time: 559.375 ms
Time: 7336.079 ms (00:07.336)
Time: 5049.222 ms (00:05.049)

0001 only:
Time: 559.466 ms
Time: 7272.719 ms (00:07.273)
Time: 4878.817 ms (00:04.879)

0002 only:
Time: 555.255 ms
Time: 6061.435 ms (00:06.061)
Time: 4440.829 ms (00:04.441)

0001-0002:
Time: 555.144 ms
Time: 5378.270 ms (00:05.378)
Time: 5745.331 ms (00:05.745)

0001-0003:
Time: 554.707 ms
Time: 2189.187 ms (00:02.189)
Time: 2193.870 ms (00:02.194)


comet lake / gcc 11

master:
Time: 383.196 ms
Time: 2323.875 ms (00:02.324)
Time: 1866.594 ms (00:01.867)

0001 only:
Time: 379.193 ms
Time: 2046.540 ms (00:02.047)
Time: 1720.601 ms (00:01.721)

0002 only:
Time: 374.072 ms
Time: 1769.163 ms (00:01.769)
Time: 1485.612 ms (00:01.486)

0001-0002:
Time: 374.454 ms
Time: 1208.933 ms (00:01.209)
Time: 1018.236 ms (00:01.018)

0001-0003:
Time: 376.698 ms
Time: 1278.807 ms (00:01.279)
Time: 1177.005 ms (00:01.177)


3-year old core i5 / gcc 11

master:
Time: 378.955 ms
Time: 2329.763 ms (00:02.330)
Time: 1891.931 ms (00:01.892)

0001 only:
Time: 384.547 ms
Time: 2342.837 ms (00:02.343)
Time: 1906.436 ms (00:01.906)

0002 only:
Time: 379.096 ms
Time: 2051.755 ms (00:02.052)
Time: 1641.376 ms (00:01.641)

0001-0002:
Time: 377.758 ms
Time: 1293.624 ms (00:01.294)
Time: 1096.991 ms (00:01.097)

0001-0003:
Time: 377.767 ms
Time: 2052.826 ms (00:02.053)
Time: 1329.185 ms (00:01.329)


Re: do only critical work during single-user vacuum?

2022-02-04 Thread John Naylor
On Thu, Feb 3, 2022 at 7:30 PM Robert Haas  wrote:
>
> That error comes from GetNewTransactionId(), so that function must
> either try to execute DML or do something else which causes an XID to
> be assigned. I think a plain SELECT should work just fine.

It was indeed doing writes, so that much is not a surprise anymore.

On Thu, Feb 3, 2022 at 9:08 PM Robert Haas  wrote:
>
> On Thu, Feb 3, 2022 at 8:35 PM Andres Freund  wrote:
> > Yea, I'd have no problem leaving the "hard" limit somewhere closer to 1
> > million (although 100k should be just as well), but introduce a softer "only
> > vacuum/drop/truncate" limit a good bit before that.
>
> +1.

Since there seems to be agreement on this, I can attempt a stab at it,
but it'll be another week before I can do so.

--
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2022-02-03 Thread John Naylor
On Thu, Feb 3, 2022 at 3:14 AM Masahiko Sawada  wrote:

> +  The only other option that may be combined with
> VERBOSE, although in single-user mode no client
> messages are
> +  output.
>
> Given VERBOSE with EMERGENCY can work only in multi-user mode, why
> only VERBOSE can be specified with EMERGENCY? I think the same is true
> for other options like PARALLEL; PARALLEL can work only in multi-user
> mode.

You are right; it makes sense to allow options that would be turned
off automatically in single-user mode. Even if we don't expect it to
be used in normal mode, the restrictions should make sense. Also,
maybe documenting the allowed combinations is a distraction in the
main entry and should be put in the notes at the bottom.

> +  It performs a database-wide vacuum on tables, toast tables, and
> materialized views whose
> +  xid age or mxid age is older than 1 billion.
>
> Do we need to allow the user to specify the threshold or need a higher
> value (at least larger than 1.6 billion, default value of
> vacuum_failsafe_age)? I imagined a case where there are a few very-old
> tables (say 2 billion old) and many tables that are older than 1
> billion. In this case, VACUUM (EMERGENCY) would take a long time to
> complete.

I still don't think fine-tuning is helpful here. Shutdown vacuum
should be just as trivial to run as it is now, but with better
behavior. I believe a user knowledgeable enough to come up with the
best number is unlikely to get in this situation in the first place.
I'm also not sure a production support engineer would (or should)
immediately figure out a better number than a good default. That said,
the 1 billion figure was a suggestion from Peter G. upthread, and a
higher number could be argued.

> But to minimize the downtime, we might want to run VACUUM
> (EMERGENCY) on only the very-old tables, start the cluster in
> multi-user mode, and run vacuum on multiple tables in parallel.

That's exactly the idea. Also, back in normal mode, we can start
streaming WAL again. However, we don't want to go back online so close
to the limit that we risk shutdown again. People have a reasonable
expectation that if you fix an emergency, it's now fixed and the
application can go back online. Falling down repeatedly, or worrying
if it's possible, is very bad.

> +   if (params->options & VACOPT_EMERGENCY)
> +   {
> +   /*
> +   * Only consider relations able to hold unfrozen XIDs (anything 
> else
> +   * should have InvalidTransactionId in relfrozenxid anyway).
> +   */
> +   if (classForm->relkind != RELKIND_RELATION &&
> +   classForm->relkind != RELKIND_MATVIEW &&
> +   classForm->relkind != RELKIND_TOASTVALUE)
> +   {
> +   Assert(!TransactionIdIsValid(classForm->relfrozenxid));
> +   Assert(!MultiXactIdIsValid(classForm->relminmxid));
> +   continue;
> +   }
> +
> +   table_xid_age = DirectFunctionCall1(xid_age,
> classForm->relfrozenxid);
> +   table_mxid_age = DirectFunctionCall1(mxid_age,
> classForm->relminmxid);
> +
>
> I think that instead of calling xid_age and mxid_age for each
> relation, we can compute the thresholds for xid and mxid once, and
> then compare them to relation's relfrozenxid and relminmxid.

That sounds like a good idea if it's simple to implement, so I will
try it. If it adds complexity, I don't think it's worth it. Scanning a
few thousand rows in pg_class along with the function calls is tiny
compared to the actual vacuum work.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2022-02-03 Thread John Naylor
On Thu, Feb 3, 2022 at 1:06 PM Robert Haas  wrote:
>
> On Thu, Dec 9, 2021 at 8:56 PM Andres Freund  wrote:
> > I think we should move *away* from single user mode, rather than the
> > opposite. It's a substantial code burden and it's hard to use.
>
> Yes. This thread seems to be largely devoted to the topic of making
> single-user vacuum work better, but I don't see anyone asking the
> question "why do we have a message that tells people to vacuum in
> single user mode in the first place?". It's basically bad advice, with
> one small exception that I'll talk about in a minute.

The word "advice" sounds like people have a choice, rather than the
system not accepting commands anymore. It would be much less painful
if the system closed connections and forbade all but superusers to
connect, but that sounds like a lot of work. (happy to be proven
otherwise)

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2022-02-03 Thread John Naylor
Thinking further about the use of emergency mode, we have this:

"If for some reason autovacuum fails to clear old XIDs from a table,
the system will begin to emit warning messages like this when the
database's oldest XIDs reach forty million transactions from the
wraparound point:

WARNING:  database "mydb" must be vacuumed within 39985967 transactions
HINT:  To avoid a database shutdown, execute a database-wide VACUUM in
that database.
"

It seems people tend not to see these warnings if they didn't already
have some kind of monitoring which would prevent them from getting
here in the first place. But if they do, the hint should mention the
emergency option here, too. This puts Justin's idea upthread in a new
light -- if the admin does notice this warning, then emergency mode
should indeed vacuum the oldest tables first, since autovacuum is not
(yet) smart enough to do that. I'll pursue that as a follow-up.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2022-02-03 Thread John Naylor
On Thu, Feb 3, 2022 at 4:58 PM Andres Freund  wrote:
>
> Hi,
>
> On 2022-02-03 16:18:27 -0500, John Naylor wrote:
> > I just checked some client case notes where they tried just that
> > before getting outside help, and both SELECT and VACUUM FREEZE
> > commands were rejected.
>
> What kind of SELECT was that? Any chance it caused a write via functions, a
> view, whatnot? And what version? What was the exact error message?

Looking closer, there is a function defined by an extension. I'd have
to dig further to see if writes happen. The error is exactly what
we've been talking about:

2022-01-03 22:03:23 PST ERROR: database is not accepting commands to
avoid wraparound data loss in database ""
2022-01-03 22:03:23 PST HINT: Stop the postmaster and vacuum that
database in single-user mode. You might also need to commit or roll
back old prepared transactions.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2022-02-03 Thread John Naylor
On Thu, Feb 3, 2022 at 1:42 PM Robert Haas  wrote:
>
> On Thu, Feb 3, 2022 at 1:34 PM John Naylor  
> wrote:
> > The word "advice" sounds like people have a choice, rather than the
> > system not accepting commands anymore. It would be much less painful
> > if the system closed connections and forbade all but superusers to
> > connect, but that sounds like a lot of work. (happy to be proven
> > otherwise)
>
> They *do* have a choice. They can continue to operate the system in
> multi-user mode, they can have read access to their data, and they can
> run VACUUM and other non-XID-allocating commands to fix the issue.
> Sure, their application can't run commands that allocate XIDs, but
> it's not going to be able to do that if they go to single-user mode
> either.

I just checked some client case notes where they tried just that
before getting outside help, and both SELECT and VACUUM FREEZE
commands were rejected. The failure is clearly indicated in the log.
-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Stats collector's idx_blks_hit value is highly misleading in practice

2022-02-03 Thread John Naylor
On Fri, Oct 30, 2020 at 9:46 PM Tomas Vondra 
wrote:
>
> On Fri, Oct 16, 2020 at 03:35:51PM -0700, Peter Geoghegan wrote:

> >I suppose that a change like this could end up affecting other things,
> >such as EXPLAIN ANALYZE statistics. OTOH we only break out index pages
> >separately for bitmap scans at the moment, so maybe it could be fairly
> >well targeted. And, maybe this is unappealing given the current
> >statistics collector limitations. I'm not volunteering to work on it
> >right now, but it would be nice to fix this. Please don't wait for me
> >to do it.
> >
>
> It seems to me this should not be a particularly difficult patch in
> principle, so suitable for new contributors. The main challenge would be
> passing information about what page we're dealing with (internal/leaf)
> to the place actually calling pgstat_count_buffer_(read|hit). That
> happens in ReadBufferExtended, which just has no idea what page it's
> dealing with. Not sure how to do that cleanly ...

Is this a TODO candidate? What would be a succinct title for it?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Stats collector's idx_blks_hit value is highly misleading in practice

2022-02-04 Thread John Naylor
On Fri, Feb 4, 2022 at 11:19 AM Peter Geoghegan  wrote:
>
> On Thu, Feb 3, 2022 at 7:08 PM John Naylor  
> wrote:
> > Is this a TODO candidate? What would be a succinct title for it?
>
> I definitely think that it's worth working on. I suppose it follows
> that it should go on the TODO list.

Added TODO item "Teach stats collector to differentiate between
internal and leaf index pages"

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

2022-02-04 Thread John Naylor
On Sat, Jan 29, 2022 at 11:43 PM Peter Geoghegan  wrote:
>
> On Thu, Jan 20, 2022 at 2:00 PM Peter Geoghegan  wrote:
> > I do see some value in that, too. Though it's not going to be a way of
> > turning off the early freezing stuff, which seems unnecessary (though
> > I do still have work to do on getting the overhead for that down).
>
> Attached is v7, a revision that overhauls the algorithm that decides
> what to freeze. I'm now calling it block-driven freezing in the commit
> message. Also included is a new patch, that makes VACUUM record zero
> free space in the FSM for an all-visible page, unless the total amount
> of free space happens to be greater than one half of BLCKSZ.
>
> The fact that I am now including this new FSM patch (v7-0006-*patch)
> may seem like a case of expanding the scope of something that could
> well do without it. But hear me out! It's true that the new FSM patch
> isn't essential. I'm including it now because it seems relevant to the
> approach taken with block-driven freezing -- it may even make my
> general approach easier to understand.

Without having looked at the latest patches, there was something in
the back of my mind while following the discussion upthread -- the
proposed opportunistic freezing made a lot more sense if the
earlier-proposed open/closed pages concept was already available.

> Freezing whole pages
> 

> It's possible that a higher cutoff (for example a cutoff of 80% of
> BLCKSZ, not 50%) will actually lead to *worse* space utilization, in
> addition to the downsides from fragmentation -- it's far from a simple
> trade-off. (Not that you should believe that 50% is special, it's just
> a starting point for me.)

How was the space utilization with the 50% cutoff in the TPC-C test?

> TPC-C raw numbers
> =
>
> The single most important number for the patch might be the decrease
> in both buffer misses and buffer hits, which I believe is caused by
> the patch being able to use index-only scans much more effectively
> (with modifications to BenchmarkSQL to improve the indexing strategy
> [1]). This is quite clear from pg_stat_database state at the end.
>
> Patch:

> blks_hit | 174,551,067,731
> tup_fetched  | 124,797,772,450

> Here is the same pg_stat_database info for master:

> blks_hit | 283,015,966,386
> tup_fetched  | 237,052,965,901

That's impressive.

--
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-01-27 Thread John Naylor
Hi,

I've run a few tests to get some feel for the effects of various
comparators on Datums containing int32. I've attached the full
results, as well as the (messy) patch which applies on top of 0012 to
run the tests. I'll excerpt some of those results as I go through them
here. For now, I only ran input orders of sorted, random, and
reversed.

1) Specializing

This is a win in all cases, including SQL-callable comparators (the
case here is for _bt_sort_array_elements).

NOTICE:  [traditional qsort] size=8MB, order=random, cmp=arg, test=2,
time=0.140526
NOTICE:  [inlined] size=8MB, order=random, cmp=inline, test=0, time=0.085023

NOTICE:  [SQL arg] size=8MB, order=random, cmp=SQL-arg, test=2, time=0.256708
NOTICE:  [SQL inlined] size=8MB, order=random, cmp=SQL-inline, test=0,
time=0.192063

2) Branchless operations

The int case is for how to perform the comparison, and the SQL case is
referring to how to reverse the sort order.Surprisingly, they don't
seem to help for direct comparisons, and in fact they seem worse. I'll
have to dig a bit deeper to be sure, but it's not looking good now.

NOTICE:  [inlined] size=8MB, order=random, cmp=inline, test=2, time=0.084781
NOTICE:  [branchless] size=8MB, order=random, cmp=branchless, test=0,
time=0.091837

NOTICE:  [SQL inlined] size=8MB, order=random, cmp=SQL-inline, test=2,
time=0.192018
NOTICE:  [SQL inlined reverse] size=8MB, order=random,
cmp=SQL-inline-rev, test=0, time=0.190797

When the effect is reversing a list, the direct comparisons seem much
worse, and the SQL ones aren't helped.

NOTICE:  [inlined] size=8MB, order=decreasing, cmp=inline, test=2, time=0.024963
NOTICE:  [branchless] size=8MB, order=decreasing, cmp=branchless,
test=0, time=0.036423

NOTICE:  [SQL inlined] size=8MB, order=decreasing, cmp=SQL-inline,
test=0, time=0.125182
NOTICE:  [SQL inlined reverse] size=8MB, order=increasing,
cmp=SQL-inline-rev, test=0, time=0.127051

--
Since I have a couple more planned tests, I'll keep a running tally on
the current state of the patch set so that summaries are not scattered
over many emails:

0001 - bsearch and unique is good to have, and we can keep the return
type pending further tests
0002/3 - I've yet to see a case where branchless comparators win, but
other than that, these are good. Notational improvement and not
performance sensitive.

0004/5 - Computing the arguments slows it down, but accessing the
underlying int16s gives an improvement. [1] Haven't done an in-situ
test on VACUUM. Could be worth it for pg15, since I imagine the
proposals for dead tuple storage won't be ready this cycle.
0006 - I expect this to be slower too. I also wonder if this could
also use the global function in 0004 once it's improved.

0007 - untested

0008 - Good performance in microbenchmarks, no in-situ testing.
Inlined reversal is not worth the binary space or notational overhead.

0009 - Based on 0004, I would guess that computing the arguments is
too slow. Not sure how to test in-situ to see if specializing helps.

0010 - Thresholds on my TODO list.

0011 - A simple correction -- I'll go ahead and commit this.

v3-0001 comparators for abbreviated keys - Clearly a win, especially
for the "unsigned" case [2]. There are still possible improvements,
but they seem like a pg16 project(s).

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2BS5SMoG8Z2PHj0bsK70CxVLgqQR1orQJq6Cjgibu26vA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAFBsxsEFGAJ9eBpQVb5a86BE93WER3497zn2OT5wbjm1HHcqgA%40mail.gmail.com
(I just realized in that message I didn't attach the script for that,
and also attached an extra draft spreadsheet. I'll improve the tests
and rerun later)

-- 
John Naylor
EDB: http://www.enterprisedb.com
NOTICE:  [traditional qsort] size=8MB, order=random, cmp=arg, test=0, 
time=0.144545
NOTICE:  [traditional qsort] size=8MB, order=random, cmp=arg, test=1, 
time=0.140534
NOTICE:  [traditional qsort] size=8MB, order=random, cmp=arg, test=2, 
time=0.140526
NOTICE:  [inlined] size=8MB, order=random, cmp=inline, test=0, time=0.085023
NOTICE:  [inlined] size=8MB, order=random, cmp=inline, test=1, time=0.086370
NOTICE:  [inlined] size=8MB, order=random, cmp=inline, test=2, time=0.084781
NOTICE:  [branchless] size=8MB, order=random, cmp=branchless, test=0, 
time=0.091837
NOTICE:  [branchless] size=8MB, order=random, cmp=branchless, test=1, 
time=0.090426
NOTICE:  [branchless] size=8MB, order=random, cmp=branchless, test=2, 
time=0.091245
NOTICE:  [SQL arg] size=8MB, order=random, cmp=SQL-arg, test=0, time=0.257024
NOTICE:  [SQL arg] size=8MB, order=random, cmp=SQL-arg, test=1, time=0.256487
NOTICE:  [SQL arg] size=8MB, order=random, cmp=SQL-arg, test=2, time=0.256708
NOTICE:  [SQL inlined] size=8MB, order=random, cmp=SQL-inline, test=0, 
time=0.192063
NOTICE:  [SQL inlined] size=8MB, order=random, cmp=SQL-inline, test=1, 
time=0.191919
NOTICE:  [SQL inlined] size=8MB, order=random, cmp=SQL-inline, test=2, 
time=0.192018
NOTICE:  [SQL arg reverse

Re: A qsort template

2022-01-31 Thread John Naylor
I wrote:

> 0010 - Thresholds on my TODO list.

I did some basic tests on the insertion sort thresholds, and it looks
like we could safely and profitably increase the current value from 7
to 20 or so, in line with other more recent implementations. I've
attached an addendum on top of 0012 and the full test results on an
Intel Coffee Lake machine with gcc 11.1. I found that the object test
setup in 0012 had some kind of bug that was comparing the pointer of
the object array. Rather than fix that, I decided to use Datums, but
with the two extremes in comparator: simple branching with machine
instructions vs. a SQL-callable function. The papers I've read
indicate the results for Datum sizes would not be much different for
small structs. The largest existing sort element is SortTuple, but
that's only 24 bytes and has a bulky comparator as well.

The first thing to note is that I rejected outright any testing of a
"middle value" where the pivot is simply the middle of the array. Even
the Bently and McIlroy paper which is the reference for our
implementation says "The range that consists of the single integer 7
could be eliminated, but has been left adjustable because on some
machines larger ranges are a few percent better".

I tested thresholds up to 64, which is where I guessed results to get
worse (most implementations are smaller than that). Here are the best
thresholds at a quick glance:

- elementary comparator:

random: 16 or greater
decreasing, rotate: get noticeably better all the way up to 64
organ: little difference, but seems to get better all the way up to 64
0/1: seems to get worse above 20

- SQL-callable comparator:

random: between 12 and 20, but slight differences until 32
decreasing, rotate: get noticeably better all the way up to 64
organ: seems best at 12, but slight differences until 32
0/1: slight differences

Based on these tests and this machine, it seems 20 is a good default
value. I'll repeat this test on one older Intel and one non-Intel
platform with older compilers.

--
Running tally of patchset:

0001 - bsearch and unique is good to have, and we can keep the return
type pending further tests -- if none happen this cycle, suggest
committing this without the return type symbol.
0002/3 - I've yet to see a case where branchless comparators win, but
other than that, these are good. Notational improvement and not
performance sensitive.

0004/5 - Computing the arguments slows it down, but accessing the
underlying int16s gives an improvement. [1] Haven't done an in-situ
test on VACUUM. Could be worth it for pg15, since I imagine the
proposals for dead tuple storage won't be ready this cycle.
0006 - I expect this to be slower too. I also wonder if this could
also use the global function in 0004 once it's improved.

0007 - untested

0008 - Good performance in microbenchmarks, no in-situ testing.
Inlined reversal is not worth the binary space or notational overhead.

0009 - Based on 0004, I would guess that computing the arguments is
too slow. Not sure how to test in-situ to see if specializing helps.

0010 - Suggest leaving out the middle threshold and setting the
insertion sort threshold to ~20. Might also name them
ST_INSERTION_SORT_THRESHOLD and ST_NINTHER_THRESHOLD. (TODO: test on
other platforms)

0011 - Committed.

v3-0001 comparators for abbreviated keys - Clearly a win in this state
already, especially
for the "unsigned" case [2]. (gist untested) There are additional
possible improvements mentioned,
but they seem like a PG16 project(s).

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2BS5SMoG8Z2PHj0bsK70CxVLgqQR1orQJq6Cjgibu26vA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAFBsxsEFGAJ9eBpQVb5a86BE93WER3497zn2OT5wbjm1HHcqgA%40mail.gmail.com
(TODO: refine test)

-- 
John Naylor
EDB: http://www.enterprisedb.com
NOTICE:  [direct] size=8MB, order=random, threshold=7, test=0, time=0.084503
NOTICE:  [direct] size=8MB, order=random, threshold=7, test=1, time=0.087980
NOTICE:  [direct] size=8MB, order=random, threshold=7, test=2, time=0.084299
NOTICE:  [direct] size=8MB, order=random, threshold=12, test=0, time=0.081893
NOTICE:  [direct] size=8MB, order=random, threshold=12, test=1, time=0.081907
NOTICE:  [direct] size=8MB, order=random, threshold=12, test=2, time=0.081943
NOTICE:  [direct] size=8MB, order=random, threshold=16, test=0, time=0.080810
NOTICE:  [direct] size=8MB, order=random, threshold=16, test=1, time=0.080793
NOTICE:  [direct] size=8MB, order=random, threshold=16, test=2, time=0.080922
NOTICE:  [direct] size=8MB, order=random, threshold=20, test=0, time=0.080399
NOTICE:  [direct] size=8MB, order=random, threshold=20, test=1, time=0.080433
NOTICE:  [direct] size=8MB, order=random, threshold=20, test=2, time=0.080413
NOTICE:  [direct] size=8MB, order=random, threshold=24, test=0, time=0.080342
NOTICE:  [direct] size=8MB, order=random, threshold=24, test=1, time=0.080446
NOTICE:  [direct] size=8MB, order=random, threshold

Re: do only critical work during single-user vacuum?

2022-01-14 Thread John Naylor
I see a CF entry has been created already, and the cfbot doesn't like
my PoC. To prevent confusion, I've taken the liberty of switching the
author to myself and set to Waiting on Author. FWIW, my local build
passed make check-world after applying Justin's fix and changing a
couple other things.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-01-18 Thread John Naylor
I wrote:

> That said, I think what I'll do next is test the v3-0001 standalone
> patch with tuplesort specializations for more data types. I already
> have a decent test script that I can build on for this.

I've run a test with 10 million records using all types found in the
v3 patch "accelerate tuple sorting for common types", using a variety
of initial orderings, covering index build (btree only, no gist) and
queries (both single value and whole record). Attached is the test
script and a spreadsheet with the raw data as well as comparisons of
the min runtimes in seconds from 5 runs. This is using gcc 11.1 on
fairly recent Intel hardware.

Overall, this shows a good improvement for these types. One exception
is the "0/1" ordering, which is two values in random order. I'm
guessing it's because of the cardinality detector, but some runs have
apparent possible regressions. It's a bit high and sporadic to just
blow off as noise, but this case might not be telling us anything
useful.

Other notes:

- The inet type seems unnaturally fast in some places, meaning faster
than int or date. That's suspicous, but I haven't yet dug deeper into
that.

- With the patch, the VM binary size increases by ~9kB.

I have some hunches on the "future research" comments:

XXX Can we avoid repeating the null-handling logic?

More templating? ;-)

XXX Is it worth specializing for reverse sort?

I'll run a limited test on DESC to see if anything stands out, but I
wonder if the use case is not common -- I seem to remember seeing DESC
less often on the first sort key column.

XXX Is it worth specializing for nulls first, nulls last, not null?

Editorializing the null position in queries is not very common in my
experience. Not null is interesting since it'd be trivial to pass
constant false to the same Apply[XYZ]SortComparator() and let the
compiler remove all those branches for us. On the other hand, those
branches would be otherwise predicted well, so it might make little or
no difference.

XXX Should we have separate cases for "result is authoritative", "need
XXX tiebreaker for atts 1..n (= abbrev case)", "need tie breaker for
XXX atts 2..n"?

The first one seems to be the only case where the SortTuple could be
smaller, since the tuple pointer is null. That sounds like a good
avenue to explore. Less memory usage is always good.

Not sure what you mean by the third case -- there are 2+ sort keys,
but the first is authoritative from the datum, so the full comparison
can skip the first key?

-- 
John Naylor
EDB: http://www.enterprisedb.com


qsort-specialize-types-jcn1.ods
Description: application/vnd.oasis.opendocument.spreadsheet


test-tuplesort-20220113.ods
Description: application/vnd.oasis.opendocument.spreadsheet


Re: do only critical work during single-user vacuum?

2022-01-21 Thread John Naylor
On Wed, Jan 19, 2022 at 5:26 PM Michael Paquier  wrote:
>
> Could you avoid introducing a new grammar pattern in VACUUM?  Any new
> option had better be within the parenthesized part as it is extensible
> at will with its set of DefElems.

This new behavior is not an option that one can sensibly mix with
other options as the user sees fit, but rather hard-codes the
parameters for its single purpose. That said, I do understand your
objection.

[*thinks*]

How about the attached patch (and test script)? It still needs polish,
but it could work. It allows "verbose" to coexist, although that's
really only for testing normal mode. While testing in single-user
mode, I was sad to find out that it not only doesn't emit messages
(not a client), but also doesn't log. That would have been a decent
way to monitor progress...

In this form, I'm no longer a fan of calling the option "wraparound",
because it's too close to the "is_wraparound" param member.
Internally, at least, we can use "emergency" or "minimal". (In fact
the bit symbol is VACOPT_MINIMAL for this draft). That can be worked
out later.

On Fri, Jan 21, 2022 at 12:59 AM Masahiko Sawada  wrote:
>
> The purpose of this thread is to provide a way for users to run vacuum
> only very old tables (while skipping index cleanup, etc.),

Ah, thank you Sawada-san, now I understand why we have been talking
past each other. The purpose is actually:

- to have a simple, easy to type, command
- intended for single-user mode, but not limited to it (so it's easy to test)
- to get out of single user mode as quickly as possible

--
John Naylor
EDB: http://www.enterprisedb.com
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 283ffaea77..6183f412d3 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -52,6 +52,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/acl.h"
+#include "utils/builtins.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -114,6 +115,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 	bool		full = false;
 	bool		disable_page_skipping = false;
 	bool		process_toast = true;
+	bool		wraparound = false;
 	ListCell   *lc;
 
 	/* index_cleanup and truncate values unspecified for now */
@@ -200,6 +202,8 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 	params.nworkers = nworkers;
 			}
 		}
+		else if (strcmp(opt->defname, "wraparound") == 0)
+			wraparound = defGetBoolean(opt);
 		else
 			ereport(ERROR,
 	(errcode(ERRCODE_SYNTAX_ERROR),
@@ -246,17 +250,51 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		}
 	}
 
+	if (wraparound)
+	{
+		/* exclude incompatible options */
+		foreach(lc, vacstmt->options)
+		{
+			DefElem*opt = (DefElem *) lfirst(lc);
+
+			// WIP is there a better way?
+			if (strcmp(opt->defname, "wraparound") != 0 &&
+strcmp(opt->defname, "verbose") != 0 &&
+defGetBoolean(opt))
+
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("option \"%s\" is incompatible with WRAPAROUND", opt->defname),
+parser_errposition(pstate, opt->location)));
+		}
+
+		/* skip unnecessary work, as in failsafe mode */
+		params.index_cleanup = VACOPTVALUE_DISABLED;
+		params.truncate = VACOPTVALUE_DISABLED;
+	}
+
 	/*
-	 * All freeze ages are zero if the FREEZE option is given; otherwise pass
-	 * them as -1 which means to use the default values.
+	 * Set freeze ages to zero where appropriate; otherwise pass
+	 * them as -1 which means to use the configured values.
 	 */
 	if (params.options & VACOPT_FREEZE)
 	{
+		/* All freeze ages are zero if the FREEZE option is given */
 		params.freeze_min_age = 0;
 		params.freeze_table_age = 0;
 		params.multixact_freeze_min_age = 0;
 		params.multixact_freeze_table_age = 0;
 	}
+	else if (params.options & VACOPT_MINIMAL)
+	{
+		/* it's unlikely any selected table will not be eligible for aggressive vacuum, but make sure */
+		params.freeze_table_age = 0;
+		params.multixact_freeze_table_age = 0;
+
+		// WIP: It might be worth trying to do less work here, or at least hard-coding the default values
+		params.freeze_min_age = -1;
+		params.multixact_freeze_min_age = -1;
+	}
 	else
 	{
 		params.freeze_min_age = -1;
@@ -894,6 +932,8 @@ get_all_vacuum_rels(int options)
 	Relation	pgclass;
 	TableScanDesc scan;
 	HeapTuple	tuple;
+	int32 		table_xid_age,
+table_mxid_age;
 
 	pgclass = table_open(RelationRelationId, AccessShareLock);
 
@@ -909,12 +949,42 @@ get_all_vacuum_rels(int options)
 		if (!vacuum_is_relation_owner(relid, classForm, options))
 			continue;
 
+		if (options | VACOPT_MINIMAL)
+		{
+			/*
+			* Only consider relations

Re: Time to increase hash_mem_multiplier default?

2022-01-19 Thread John Naylor
On Sun, Jan 16, 2022 at 7:28 PM Peter Geoghegan  wrote:
>
> The current hash_mem_multiplier default is 1.0, which is a fairly
> conservative default: it preserves the historic behavior, which is
> that hash-based executor nodes receive the same work_mem budget as
> sort-based nodes. I propose that the default be increased to 2.0 for
> Postgres 15.

I don't have anything really profound to say here, but in the last
year I did on a couple occasions recommend clients to raise
hash_mem_multiplier to 2.0 to fix performance problems.

During this cycle, we also got a small speedup in the external sorting
code. Also, if the "generation context" idea gets traction, that might
be another reason to consider differentiating the mem settings.
--
John Naylor
EDB: http://www.enterprisedb.com




Re: speed up text_position() for utf-8

2022-01-19 Thread John Naylor
I wrote:

> 0001 puts the main implementation of pg_utf_mblen() into an inline
> function and uses this in pg_mblen(). This is somewhat faster in the
> strpos tests, so that gives some measure of the speedup expected for
> other callers. Text search seems to call this a lot, so this might
> have noticeable benefit.
>
> 0002 refactors text_position_get_match_pos() to use
> pg_mbstrlen_with_len(). This itself is significantly faster when
> combined with 0001, likely because the latter can inline the call to
> pg_mblen(). The intention is to speed up more than just text_position.
>
> 0003 explicitly specializes for the inline version of pg_utf_mblen()
> into pg_mbstrlen_with_len(), but turns out to be almost as slow as
> master for ascii. It doesn't help if I undo the previous change in
> pg_mblen(), and I haven't investigated why yet.
>
> 0002 looks good now, but the experience with 0003 makes me hesitant to
> propose this seriously until I can figure out what's going on there.
>
> The test is as earlier, a worst-case substring search, times in milliseconds.
>
>  patch  | no match | ascii | multibyte
> +--+---+---
>  PG11   | 1220 |  1220 |  1150
>  master |  385 |  2420 |  1980
>  0001   |  390 |  2180 |  1670
>  0002   |  389 |  1330 |  1100
>  0003   |  391 |  2100 |  1360

I tried this test on a newer CPU, and 0003 had no regression. Both
systems used gcc 11.2. Rather than try to figure out why an experiment
had unexpected behavior, I plan to test 0001 and 0002 on a couple
different compilers/architectures and call it a day. It's also worth
noting that 0002 by itself seemed to be decently faster on the newer
machine, but not as fast as 0001 and 0002 together.

Looking at the assembly, pg_mblen is inlined into
pg_mbstrlen_[with_len] and pg_mbcliplen, so the specialization for
utf-8 in 0001 would be inlined in the other 3 as well. That's only a
few bytes, so I think it's fine.


--
John Naylor
EDB: http://www.enterprisedb.com




Re: do only critical work during single-user vacuum?

2022-01-19 Thread John Naylor
On Wed, Jan 19, 2022 at 12:46 AM Masahiko Sawada  wrote:
>
> On Fri, Jan 14, 2022 at 7:04 AM Bossart, Nathan  wrote:
> >
> > I guess I'm ultimately imagining the new options as replacing the
> > vacuumdb implementation.  IOW vacuumdb would just use MIN_(M)XID_AGE
> > behind the scenes (as would a new top-level command).
>
> I had the same idea.

This seems to be the motivating reason for wanting new configurability
on the server side. In any case, new knobs are out of scope for this
thread. If the use case is compelling enough, may I suggest starting a
new thread?

Regarding the thread subject, I've been playing with the grammar, and
found it's quite easy to have

VACUUM FOR WRAPAROUND
or
VACUUM FOR EMERGENCY

since FOR is a reserved word (and following that can be an IDENT plus
a strcmp check) and cannot conflict with table names. This sounds a
bit more natural than VACUUM LIMIT. Opinions?

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-01-19 Thread John Naylor
On Tue, Jan 18, 2022 at 9:58 PM Peter Geoghegan  wrote:
>
> On Tue, Jan 18, 2022 at 6:39 PM John Naylor
>  wrote:
> > Editorializing the null position in queries is not very common in my
> > experience. Not null is interesting since it'd be trivial to pass
> > constant false to the same Apply[XYZ]SortComparator() and let the
> > compiler remove all those branches for us. On the other hand, those
> > branches would be otherwise predicted well, so it might make little or
> > no difference.
>
> If you were going to do this, maybe you could encode NULL directly in
> an abbreviated key. I think that that could be made to work if it was
> limited to opclasses with abbreviated keys encoded as unsigned
> integers. Just a thought.

Now that you mention that, I do remember reading about this technique
in the context of b-tree access, so it does make sense. If we had that
capability, it would be trivial to order the nulls how we want while
building the sort tuple datums, and the not-null case would be handled
automatically. And have a smaller code footprint, I think.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: autovacuum prioritization

2022-01-25 Thread John Naylor
On Tue, Jan 25, 2022 at 2:30 PM Robert Haas  wrote:
>
> On Mon, Jan 24, 2022 at 11:14 PM Dilip Kumar  wrote:

> > I think we need to make different priority
> > queues based on different factors, for example 1 queue for wraparound
> > risk and another for bloat risk.
>
> I don't see why we want multiple queues. We have to answer the
> question "what should we do next?" which requires us, in some way, to
> funnel everything into a single prioritization.

I was thinking along the same lines as Dilip: If the anti-wraparound
risk is really far in the future, there might not be much eligible
freezing work to do. Dead tuples can be removed as soon as visibility
rules allow it. With a separate bloat queue, there might always be
some work to do. Maybe "bloat queue" is too specific, because
insert-only tables can use more vacuuming for the VM even if they have
not reached the configured threshold.

So a worker would check the wraparound queue, and if nothing's there
grab something from the other queue. Maybe low-priority work would
have a low cost limit.

Probably the true best way to do schedule, at least at first, is
what's the least complex. I'm not yet sure what that is...
-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Mark all GUC variable as PGDLLIMPORT

2022-04-05 Thread John Naylor
On Tue, Apr 5, 2022 at 10:06 PM Robert Haas  wrote:
>
> On Tue, Apr 5, 2022 at 10:07 AM Tom Lane  wrote:
> > Yeah, the frontend error message rework in [1].  That has exactly
> > the same constraint that it's likely to break other open patches,
> > so it'd be better to do it after the CF cutoff.  I think that doing
> > that concurrently with Robert's thing shouldn't be too risky, because
> > it only affects frontend code while his patch should touch only backend.
>
> So when *exactly* do we want to land these patches? None of the
> calendar programs I use support "anywhere on earth" as a time zone,
> but I think that feature freeze is 8am on Friday on the East coast of
> the United States.

I understand it to be noon UTC on Friday.

> If I commit the PGDLLIMPORT change within a few
> hours after that time, is that good? Should I try to do it earlier,
> before we technically hit 8am? Should I do it the night before, last
> thing before I go to bed on Thursday? Do you care whether your commit
> or mine goes in first?

For these two patches, I'd say a day or two after feature freeze is a
reasonable goal.

--
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-04-11 Thread John Naylor
On Mon, Apr 11, 2022 at 5:34 AM David Rowley  wrote:

> With this particular test, v15 is about 15% *slower* than v14.  I
> didn't know what to blame at first, so I tried commenting out the sort
> specialisations and got the results in the red bars in the graph. This
> made it about 7.5% *faster* than v14. So looks like this patch is to
> blame.  I then hacked the comparator function that's used in the
> specialisations for BIGINT to comment out the tiebreak to remove the
> indirect function call, which happens to do nothing in this 1 column
> sort case.  The aim here was to get an idea what the performance would
> be if there was a specialisation for single column sorts. That's the
> yellow bars, which show about 10% *faster* than master.

Thanks for investigating! (I assume you meant 10% faster than v14?)

On Mon, Apr 11, 2022 at 4:55 AM Peter Geoghegan  wrote:

> The B quicksort implementation that we adopted is generally
> extremely fast for that case, since it uses 3 way partitioning (based
> on the Dutch National Flag algorithm). This essentially makes sorting
> large groups of duplicates take only linear time (not linearithmic
> time).

In the below thread, I wondered if it still counts as extremely fast
nowadays. I hope to give an answer to that during next cycle. Relevant
to the open item, the paper linked there has a variety of
low-cardinality cases. I'll incorporate them in a round of tests soon.

https://www.postgresql.org/message-id/cafbsxshanjtsx9dnjppxjxwg3bu+yq6pnmsfpm0uvyuafdw...@mail.gmail.com

On Mon, Apr 11, 2022 at 4:44 AM Thomas Munro  wrote:

> Upthread we were discussing which variations it'd be worth investing
> extra text segment space on to gain speedup and we put those hard
> decisions off for future work, but on reflection, we probably should
> tackle this particular point to avoid a regression.  I think something
> like the attached achieves that (draft, not tested much yet, could
> perhaps find a tidier way to code the decision tree).  In short:
> variants qsort_tuple_{int32,signed,unsigned}() no longer fall back,
> but new variants qsort_tuple_{int32,signed,unsigned}_tiebreak() do.

Looks good at a glance, I will get some numbers after modifying my test scripts.

> We should perhaps also reconsider the other XXX comment about finding
> a way to skip the retest of column 1 in the tiebreak comparator.
> Perhaps you'd just install a different comparetup function, eg
> comparetup_index_btree_tail (which would sharing code), so no need to
> multiply specialisations for that.

If we need to add these cases to avoid regression, it makes sense to
make them work as well as we reasonably can.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-04-13 Thread John Naylor
As promised, I've done another round of tests (script and spreadsheet
attached) with

- v15 with 6974924347 and cc58eecc5d reverted
- v15 with Thomas' patch
- v15 with David's patch
- v15 as is ("std")

...where v15 is at 7b735f8b52ad. This time I limited it to int,
bigint, and text types.

Since more cases now use random distributions,  I also took some
measures to tighten up the measurements:

- Reuse the same random distribution for all tests where the input is
randomized, by invoking the script with/without a second parameter
- For the text case, use lpadded ints so that lexicographic order is
the same as numeric order.

I verified David's mod100 test case and added most test cases from the
Orson Peters paper I mentioned above. I won't explain all of them
here, but the low cardinality ones are randomized sets of:

- mod8
- dupsq: x mod sqrt(n) , for 10 million about 3 thousand distinct values
- dup8: (x**8 + n/2) mod n , for 10 million about 80 thousand distinct
values, about 80% with 64 duplicates and 20% with 256 duplicates

All the clear regressions I can see in v15 are in the above for one or
more query types / data types, and both Thomas and David's patches
restore performance for those.

More broadly than the regression, Thomas' is very often the fastest of
all, at the cost of more binary size. David's is occasionally slower
than v15 or v15 with revert, but much of that is a slight difference
and some is probably noise.

-- 
John Naylor
EDB: http://www.enterprisedb.com


qsort-fix-regression-jcn1.ods
Description: application/vnd.oasis.opendocument.spreadsheet


sort-bench-int-text-plus-low-card.sh
Description: application/shellscript


Re: A qsort template

2022-04-06 Thread John Naylor
Here is the updated insertion sort threshold patch based on Thomas'
experimental v4 0010, with adjusted regression test output. I only
found a couple places where it could make sense to add sort keys to
test queries, but 1) not enough to make a big difference and 2) the
adjustments looked out of place, so I decided to just update all the
regression tests in one go. Since the patch here is a bit more (and
less) involved than Thomas' 0010, I'm going to refrain from committing
until it gets review. If not in the next couple days, I will bring it
up at the beginning of the v16 cycle.

-- 
John Naylor
EDB: http://www.enterprisedb.com
From 74b934bc5ed8f6733c064c0ef832e1aa9949f216 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 6 Apr 2022 16:38:28 +0700
Subject: [PATCH v5] Raise qsort insertion sort threshold

Our qsort algorithm is from NetBSD and is described in the 1993 paper
"Engineering a Sort Function". It includes three hard coded thresholds that
control the point at which we drop to insertion sort, and whether we use 1,
3 or 9 samples for choosing a pivot. The insertion sort threshold of 7 was
chosen by testing on hardware available in 1993.  Our testing has shown
that hardware from 2010 to 2020 prefer a larger threshold. The ideal value
varies with input distribution, but 20 seems to be a good compromise that
works well with all of them. This is more in line with thresholds found
in other software as well.

Since the new insertion sort threshold is larger then the singleton range
where a single sample is chosen for the pivot, get rid of the middle
range. There are now two thresholds.

Also use macros intead of hard-coded values. This improves readability
and enables specializing the thresholds if desired.

The changes in the regression tests are needed for the change in sort
stability when the sort key contains duplicates.

Thomas Munro and John Naylor

Discussion:
https://www.postgresql.org/message-id/CA%2BhUKGJhOtjQH%2BwjCodtJzHAFCAPYyt6Qm9E1mUoJph42qo1hg%40mail.gmail.com
Discussion:
https://www.postgresql.org/message-id/CAFBsxsHr-C1xqjUMjeUN3-FvNzKiAt3gcfBKt8PFN2mov7z2gQ%40mail.gmail.com
---
 .../expected/pg_stat_statements.out   |   6 +-
 src/include/lib/sort_template.h   |  38 +-
 src/test/regress/expected/create_index.out|   2 +-
 src/test/regress/expected/geometry.out|   2 +-
 src/test/regress/expected/groupingsets.out|   4 +-
 src/test/regress/expected/inet.out|   4 +-
 src/test/regress/expected/join.out|   2 +-
 src/test/regress/expected/sqljson.out |  10 +-
 src/test/regress/expected/tsrf.out|  28 +-
 src/test/regress/expected/tuplesort.out   |  10 +-
 src/test/regress/expected/window.out  | 542 +-
 11 files changed, 331 insertions(+), 317 deletions(-)

diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index e0abe34bb6..aeb8f04aea 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -169,12 +169,12 @@ SELECT *
 SELECT * FROM test ORDER BY a;
  a |  b   
 ---+--
- 1 | a   
  1 | 111 
- 2 | b   
+ 1 | a   
  2 | 222 
- 3 | c   
+ 2 | b   
  3 | 333 
+ 3 | c   
  4 | 444 
  5 | 555 
  6 | 666 
diff --git a/src/include/lib/sort_template.h b/src/include/lib/sort_template.h
index 3122a93009..7b92b2c084 100644
--- a/src/include/lib/sort_template.h
+++ b/src/include/lib/sort_template.h
@@ -40,6 +40,11 @@
  *
  *	  - ST_COMPARE_ARG_TYPE - type of extra argument
  *
+ *	  Optional macros for tuning algorithm choices:
+ *
+ *	  - ST_THRESHOLD_INSERTION_SORT - below this size we do insertion sort
+ *	  - ST_THRESHOLD_MED9 - above this size we partition with med9, otherwise with med3
+ *
  *	  The prototype of the generated sort function is:
  *
  *	  void ST_SORT(ST_ELEMENT_TYPE *data, size_t n,
@@ -172,6 +177,14 @@
 #define ST_SORT_INVOKE_ARG
 #endif
 
+/* Default input size thresholds to control algorithm behavior. */
+#ifndef ST_THRESHOLD_INSERTION_SORT
+#define ST_THRESHOLD_INSERTION_SORT 20
+#endif
+#ifndef ST_THRESHOLD_MED9
+#define ST_THRESHOLD_MED9 40
+#endif
+
 #ifdef ST_DECLARE
 
 #ifdef ST_COMPARE_RUNTIME_POINTER
@@ -296,7 +309,7 @@ ST_SORT(ST_ELEMENT_TYPE * data, size_t n
 
 loop:
 	DO_CHECK_FOR_INTERRUPTS();
-	if (n < 7)
+	if (n < ST_THRESHOLD_INSERTION_SORT)
 	{
 		for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
 			 pm += ST_POINTER_STEP)
@@ -318,21 +331,20 @@ loop:
 	}
 	if (presorted)
 		return;
+	/* Partition with median of three for medium input. */
+	pl = a;
 	pm = a + (n / 2) * ST_POINTER_STEP;
-	if (n > 7)
+	pn = a + (n - 1) * ST

<    2   3   4   5   6   7   8   9   10   11   >