Re: Some problems of recovery conflict wait events
On Fri, 27 Mar 2020 at 10:32, Fujii Masao wrote: > > > > On 2020/03/26 14:33, Masahiko Sawada wrote: > > On Tue, 24 Mar 2020 at 17:04, Fujii Masao > > wrote: > >> > >> > >> > >> On 2020/03/05 20:16, Fujii Masao wrote: > >>> > >>> > >>> On 2020/03/05 16:58, Masahiko Sawada wrote: > On Wed, 4 Mar 2020 at 15:21, Fujii Masao > wrote: > > > > > > > > On 2020/03/04 14:31, Masahiko Sawada wrote: > >> On Wed, 4 Mar 2020 at 13:48, Fujii Masao > >> wrote: > >>> > >>> > >>> > >>> On 2020/03/04 13:27, Michael Paquier wrote: > On Wed, Mar 04, 2020 at 01:13:19PM +0900, Masahiko Sawada wrote: > > Yeah, so 0001 patch sets existing wait events to recovery conflict > > resolution. For instance, it sets (PG_WAIT_LOCK | > > LOCKTAG_TRANSACTION) > > to the recovery conflict on a snapshot. 0003 patch improves these > > wait > > events by adding the new type of wait event such as > > WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT. Therefore 0001 (and 0002) > > patch > > is the fix for existing versions and 0003 patch is an improvement > > for > > only PG13. Did you mean even 0001 patch doesn't fit for > > back-patching? > >>> > >>> Yes, it looks like a improvement rather than bug fix. > >>> > >> > >> Okay, understand. > >> > I got my eyes on this patch set. The full patch set is in my opinion > just a set of improvements, and not bug fixes, so I would refrain > from > back-backpatching. > >>> > >>> I think that the issue (i.e., "waiting" is reported twice needlessly > >>> in PS display) that 0002 patch tries to fix is a bug. So it should be > >>> fixed even in the back branches. > >> > >> So we need only two patches: one fixes process title issue and another > >> improve wait event. I've attached updated patches. > > > > Thanks for updating the patches! I started reading 0001 patch. > > Thank you for reviewing this patch. > > > > > - /* > > -* Report via ps if we have been waiting for > > more than 500 msec > > -* (should that be configurable?) > > -*/ > > - if (update_process_title && new_status == NULL > > && > > - TimestampDifferenceExceeds(waitStart, > > GetCurrentTimestamp(), > > - > >500)) > > > > The patch changes ResolveRecoveryConflictWithSnapshot() and > > ResolveRecoveryConflictWithTablespace() so that they always add > > "waiting" into the PS display, whether wait is really necessary or not. > > But isn't it better to display "waiting" in PS basically when wait is > > necessary, like originally ResolveRecoveryConflictWithVirtualXIDs() > > does as the above? > > You're right. Will fix it. > > > > > ResolveRecoveryConflictWithDatabase(Oid dbid) > > { > > + char*new_status = NULL; > > + > > + /* Report via ps we are waiting */ > > + new_status = set_process_title_waiting(); > > > > In ResolveRecoveryConflictWithDatabase(), there seems no need to > > display "waiting" in PS because no wait occurs when recovery conflict > > with database happens. > > Isn't the startup process waiting for other backend to terminate? > >>> > >>> Yeah, you're right. I agree that "waiting" should be reported in this > >>> case. > >> > >> On second thought, in recovery conflict case, "waiting" should be reported > >> while waiting for the specified delay (e.g., by > >> max_standby_streaming_delay) > >> until the conflict is resolved. So IMO reporting "waiting" in the case of > >> recovery conflict with buffer pin, snapshot, lock and tablespace seems > >> valid, > >> because they are user-visible "expected" wait time. > >> > >> However, in the case of recovery conflict with database, the recovery > >> basically doesn't wait at all and just terminates the conflicting sessions > >> immediately. Then the recovery waits for all those sessions to be > >> terminated, > >> but that wait time is basically small and should not be the user-visible. > >> If that wait time becomes very long because of unresponsive backend, ISTM > >> that LOG or WARNING should be logged instead of reporting something in > >> PS display. I'm not sure if that logging is really necessary now, though. > >> Therefore, I'm thinking that "waiting" doesn't need to be reported in the > >> case > >> of recovery conflict with database. Thought? > > > > Fair enough. The longer wait time of conflicts with database is not > > user-expected behaviour so logging would be better. I'd like to just > > drop the ch
Re: backup manifests
Hi, On 2020-03-23 12:15:54 -0400, Robert Haas wrote: > + > +MANIFEST > + > + > + When this option is specified with a value of > ye' s/ye'/yes/ > + or force-escape, a backup manifest is created > + and sent along with the backup. The latter value forces all > filenames > + to be hex-encoded; otherwise, this type of encoding is performed > only > + for files whose names are non-UTF8 octet sequences. > + force-escape is intended primarily for testing > + purposes, to be sure that clients which read the backup manifest > + can handle this case. For compatibility with previous releases, > + the default is MANIFEST 'no'. > + > + > + Are you planning to include a specification of the manifest file format anywhere? I looked through the patches and didn't find anything. I think it'd also be good to include more information about what the point of manifest files actually is. > + > + pg_validatebackup reads the manifest file of a > + backup, verifies the manifest against its own internal checksum, and then > + verifies that the same files are present in the target directory as in the > + manifest itself. It then verifies that each file has the expected > checksum, > + unless the backup was taken the checksum algorithm set to > + none, in which case checksum verification is not > + performed. The presence or absence of directories is not checked, except > + indirectly: if a directory is missing, any files it should have contained > + will necessarily also be missing. Certain files and directories are > + excluded from verification: > + Depending on what you want to use the manifest for, we'd also need to check that there are no additional files. That seems to actually be implemented, which imo should be mentioned here. > +/* > + * Finalize the backup manifest, and send it to the client. > + */ > +static void > +SendBackupManifest(manifest_info *manifest) > +{ > + StringInfoData protobuf; > + uint8 checksumbuf[PG_SHA256_DIGEST_LENGTH]; > + charchecksumstringbuf[PG_SHA256_DIGEST_STRING_LENGTH]; > + size_t manifest_bytes_done = 0; > + > + /* > + * If there is no buffile, then the user doesn't want a manifest, so > + * don't waste any time generating one. > + */ > + if (manifest->buffile == NULL) > + return; > + > + /* Terminate the list of files. */ > + AppendStringToManifest(manifest, "],\n"); > + > + /* > + * Append manifest checksum, so that the problems with the manifest > itself > + * can be detected. > + * > + * We always use SHA-256 for this, regardless of what algorithm is > chosen > + * for checksumming the files. If we ever want to make the checksum > + * algorithm used for the manifest file variable, the client will need a > + * way to figure out which algorithm to use as close to the beginning of > + * the manifest file as possible, to avoid having to read the whole > thing > + * twice. > + */ > + manifest->still_checksumming = false; > + pg_sha256_final(&manifest->manifest_ctx, checksumbuf); > + AppendStringToManifest(manifest, "\"Manifest-Checksum\": \""); > + hex_encode((char *) checksumbuf, sizeof checksumbuf, checksumstringbuf); > + checksumstringbuf[PG_SHA256_DIGEST_STRING_LENGTH - 1] = '\0'; > + AppendStringToManifest(manifest, checksumstringbuf); > + AppendStringToManifest(manifest, "\"}\n"); Hm. Is it a great choice to include the checksum for the manifest inside the manifest itself? With a cryptographic checksum it seems like it could make a ton of sense to store the checksum somewhere "safe", but keep the manifest itself alongside the base backup itself. While not huge, they won't be tiny either. > diff --git a/src/bin/pg_validatebackup/parse_manifest.c > b/src/bin/pg_validatebackup/parse_manifest.c > new file mode 100644 > index 00..e6b42adfda > --- /dev/null > +++ b/src/bin/pg_validatebackup/parse_manifest.c > @@ -0,0 +1,576 @@ > +/*- > + * > + * parse_manifest.c > + * Parse a backup manifest in JSON format. > + * > + * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group > + * Portions Copyright (c) 1994, Regents of the University of California > + * > + * src/bin/pg_validatebackup/parse_manifest.c > + * > + *- > + */ Doesn't have to be in the first version, but could it be useful to move this to common/ or such? > +/* > + * Validate one directory. > + * > + * 'relpath' is NULL if we are to validate the top-level backup directory, > + * and otherwise the relative path to the directory that is to be validated. > + * > + * 'fullpath' is the backup directory with 'relpath'
Re: error context for vacuum to include block number
On Fri, Mar 27, 2020 at 11:46 AM Justin Pryzby wrote: > > On Thu, Mar 26, 2020 at 11:44:24PM -0500, Justin Pryzby wrote: > > On Fri, Mar 27, 2020 at 09:49:29AM +0530, Amit Kapila wrote: > > > On Fri, Mar 27, 2020 at 3:47 AM Justin Pryzby > > > wrote: > > > > > > > > > Hm, I was just wondering what happens if an error happens *during* > > > > > update_vacuum_error_cbarg(). It seems like if we set > > > > > errcbarg->phase=VACUUM_INDEX before setting > > > > > errcbarg->indname=indname, then an > > > > > error would cause a crash. > > > > > > > > > > > Can't that be avoided if you check if cbarg->indname is non-null in > > > vacuum_error_callback as we are already doing for > > > VACUUM_ERRCB_PHASE_TRUNCATE? > > > > > > > > And if we pfree and set indname before phase, it'd > > > > > be a problem when going from an index phase to non-index phase. > > > > > > How is it possible that we move to the non-index phase without > > > clearing indname as we always revert back the old phase information? > > > > The crash scenario I'm trying to avoid would be like statement_timeout or > > other > > asynchronous event occurring between two non-atomic operations. > > > > I said that there's an issue no matter what order we set indname/phase; > > If we wrote: > > |cbarg->indname = indname; > > |cbarg->phase = phase; > > ..and hit a timeout (or similar) between setting indname=NULL but before > > setting phase=VACUUM_INDEX, then we can crash due to null pointer. > > > > But if we write: > > |cbarg->phase = phase; > > |if (cbarg->indname) {pfree(cbarg->indname);} > > |cbarg->indname = indname ? pstrdup(indname) : NULL; > > ..then we can still crash if we timeout between freeing cbarg->indname and > > setting it to null, due to acccessing a pfreed allocation. > > If "phase" is updated before "indname", I'm able to induce a synthetic crash > like this: > > +if (errinfo->phase==VACUUM_ERRCB_PHASE_VACUUM_INDEX && > errinfo->indname==NULL) > +{ > +kill(getpid(), SIGINT); > +pg_sleep(1); // that's needed since signals are delivered asynchronously > +} > > And another crash if we do this after pfree but before setting indname. > > +if (errinfo->phase==VACUUM_ERRCB_PHASE_VACUUM_INDEX && > errinfo->indname!=NULL) > +{ > +kill(getpid(), SIGINT); > +pg_sleep(1); > +} > > I'm not sure if those are possible outside of "induced" errors. Maybe the > function is essentially atomic due to no CHECK_FOR_INTERRUPTS or similar? > Yes, this is exactly the point. I think unless you have CHECK_FOR_INTERRUPTS in that function, the problems you are trying to think won't happen. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: error context for vacuum to include block number
On Thu, Mar 26, 2020 at 11:44:24PM -0500, Justin Pryzby wrote: > On Fri, Mar 27, 2020 at 09:49:29AM +0530, Amit Kapila wrote: > > On Fri, Mar 27, 2020 at 3:47 AM Justin Pryzby wrote: > > > > > > > Hm, I was just wondering what happens if an error happens *during* > > > > update_vacuum_error_cbarg(). It seems like if we set > > > > errcbarg->phase=VACUUM_INDEX before setting errcbarg->indname=indname, > > > > then an > > > > error would cause a crash. > > > > > > > > Can't that be avoided if you check if cbarg->indname is non-null in > > vacuum_error_callback as we are already doing for > > VACUUM_ERRCB_PHASE_TRUNCATE? > > > > > > And if we pfree and set indname before phase, it'd > > > > be a problem when going from an index phase to non-index phase. > > > > How is it possible that we move to the non-index phase without > > clearing indname as we always revert back the old phase information? > > The crash scenario I'm trying to avoid would be like statement_timeout or > other > asynchronous event occurring between two non-atomic operations. > > I said that there's an issue no matter what order we set indname/phase; > If we wrote: > |cbarg->indname = indname; > |cbarg->phase = phase; > ..and hit a timeout (or similar) between setting indname=NULL but before > setting phase=VACUUM_INDEX, then we can crash due to null pointer. > > But if we write: > |cbarg->phase = phase; > |if (cbarg->indname) {pfree(cbarg->indname);} > |cbarg->indname = indname ? pstrdup(indname) : NULL; > ..then we can still crash if we timeout between freeing cbarg->indname and > setting it to null, due to acccessing a pfreed allocation. If "phase" is updated before "indname", I'm able to induce a synthetic crash like this: +if (errinfo->phase==VACUUM_ERRCB_PHASE_VACUUM_INDEX && errinfo->indname==NULL) +{ +kill(getpid(), SIGINT); +pg_sleep(1); // that's needed since signals are delivered asynchronously +} And another crash if we do this after pfree but before setting indname. +if (errinfo->phase==VACUUM_ERRCB_PHASE_VACUUM_INDEX && errinfo->indname!=NULL) +{ +kill(getpid(), SIGINT); +pg_sleep(1); +} I'm not sure if those are possible outside of "induced" errors. Maybe the function is essentially atomic due to no CHECK_FOR_INTERRUPTS or similar? -- Justin
Re: allow online change primary_conninfo
On Thu, Mar 26, 2020 at 09:39:17PM -0300, Alvaro Herrera wrote: > Now, would anyone be too upset if I push these in this other order? I > realized that the reason the tests broke after Sergei's patch is that > recovery/t/001_stream_rep.pl's get_slot_xmins() is broken for temp > walreceiver slots, since it's using the non-temp name it tries to give > to the slot, rather than the temp name under which it is actually > created. The workaround proposed by 0002 is to edit standby_1's config > to set walreceiver's slot to be non-temp. FWIW, I find a bit strange that the change introduced in 001_stream_rep.pl as of patch 0002 is basically undone in 0003 by changing the default value of wal_receiver_create_temp_slot. > The reason is that I think I would like to get Sergei's patch pushed > right away (0001+0002, as a single commit); but that I think there's > more to attack in the walreceiver temp slot code than 0003 here, and I > don't want to delay the new feature any longer while I figure out the > fix for that. Not sure I agree with this approach. I'd rather fix all the existing known problems first, and then introduce the new features on top of what we consider to be a clean state. If we lack of time between the first and second patches, we may have a risk of keeping the code with the new feature but without the fixes discussed for wal_receiver_create_temp_slot. > (The thing is: if I specify primary_slot_name in the config, why is the > temp walreceiver slot code not obeying that name? I think walreceiver > should create a temp slot, sure, but using the given name rather than > coming up with a random name.) Good point. I am not sure either why the current logic has been chosen. The discussion related to the original patch is in this area: https://www.postgresql.org/message-id/4792e456-d75f-8e6a-2d47-34b8f78c2...@2ndquadrant.com > (The other reason is that I would like to push one patch to fix > walreceiver tmp slot rather than two, setting the thing first this way > and then the opposite way.) So your problem here is that by applying first 0003 and then 0001-0002 you would finish by switching wal_receiver_create_temp_slot to PGC_POSTMASTER, and then back to PGC_SIGHUP again? -- Michael signature.asc Description: PGP signature
Re: backup manifests
Hi, On 2020-03-26 15:37:11 -0400, Stephen Frost wrote: > The argument is that adding checksums takes more time. I can understand > that argument, though I don't really agree with it. Certainly a few > percent really shouldn't be that big of an issue, and in many cases even > a sha256 hash isn't going to have that dramatic of an impact on the > actual overall time. I don't understand how you can come to that conclusion? It doesn't take very long to measure openssl's sha256 performance (which is pretty well optimized). Note that we do use openssl's sha256, when compiled with openssl support. On my workstation, with a pretty new (but not fastest single core perf model) intel Xeon Gold 5215, I get: $ openssl speed sha256 ... type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes 16384 bytes sha256 76711.75k 172036.78k 321566.89k 399008.09k 431423.49k 433689.94k IOW, ~430MB/s. On my laptop, with pretty fast cores: type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes 16384 bytes sha256 97054.91k 217188.63k 394864.13k 493441.02k 532100.44k 533441.19k IOW, 530MB/s 530 MB/s is well within the realm of medium sized VMs. And, as mentioned before. even if you do only half of that, you're still going to be spending roughly half of the CPU time of sending a base backup. What makes you think that a few hundred MB/s is out of reach for a large fraction of PG installations that actually keep backups? Greetings, Andres Freund
Re: backup manifests
Hi, On 2020-03-26 14:02:29 -0400, Robert Haas wrote: > On Thu, Mar 26, 2020 at 12:34 PM Stephen Frost wrote: > > Why was crc32c > > picked for that purpose? > > Because it was discovered that 64-bit CRC was too slow, per commit > 21fda22ec46deb7734f793ef4d7fa6c226b4c78e. Well, a 32bit crc, not crc32c. IIRC it was the ethernet polynomial (+ bug). We switched to crc32c at some point because there are hardware implementations: commit 5028f22f6eb0579890689655285a4778b4ffc460 Author: Heikki Linnakangas Date: 2014-11-04 11:35:15 +0200 Switch to CRC-32C in WAL and other places. > Like, suppose we change the default from CRC-32C to SHA-something. On > the upside, the error detection rate will increase from 99.999+% > to something much closer to 100%. FWIW, I don't buy the relevancy of 99.999+% at all. That's assuming a single bit error (at relevant lengths, before that it's single burst errors of a greater length), which isn't that relevant for our purposes. That's not to say that I don't think a CRC check can provide value. It does provide a high likelihood of detecting enough errors, including coding errors in how data is restored (not unimportant), that you're likely not find out aobut a problem soon. > On the downside, > backups will get as much as 40-50% slower for some users. I hope we > can agree that both detecting errors and taking backups quickly are > important. However, it is hard for me to imagine that the typical user > would want to pay even a 5-10% performance penalty when taking a > backup in order to improve an error detection feature which they may > not even use and which already has less than a one-in-a-billion chance > of going wrong. FWIW, that seems far too large a slowdown to default to for me. Most people aren't going to be able to figure out that it's the checksum parameter that causes this slowdown, there just going to feel the pain of the backup being much slower than their hardware. A few hundred megabytes of streaming reads/writes really doesn't take a beefy server these days. Medium sized VMs + a bit larger network block devices at all the common cloud providers have considerably higher bandwidth. Even a raid5x of 4 spinning disks can deliver > 500MB/s. And plenty of even the smaller instances at many providers have > 5gbit/s network. At the upper end it's way more than that. Greetings, Andres Freund
Re: Race condition in SyncRepGetSyncStandbysPriority
On 2020/03/27 10:26, Tom Lane wrote: Twice in the past month [1][2], buildfarm member hoverfly has managed to reach the "unreachable" Assert(false) at the end of SyncRepGetSyncStandbysPriority. When I search the past discussions, I found that Noah Misch reported the same issue. https://www.postgresql.org/message-id/20200206074552.gb3326...@rfd.leadboat.com What seems likely to me, after quickly eyeballing the code, is that hoverfly is hitting the blatantly-obvious race condition in that function. Namely, that the second loop supposes that the state of the walsender array hasn't changed since the first loop. The minimum fix for this, I suppose, would have the first loop capture the sync_standby_priority value for each walsender along with what it's already capturing. But I wonder if the whole function shouldn't be rewritten from scratch, because it seems like the algorithm is both expensively brute-force and unintelligible, which is a sad combination. It's likely that the number of walsenders would never be high enough that efficiency could matter, but then couldn't we use an algorithm that is less complicated and more obviously correct? +1 to rewrite the function with better algorithm. (Because the alternative conclusion, if you reject the theory that a race is happening, is that the algorithm is just flat out buggy; something that's not too easy to disprove either.) Another fairly dubious thing here is that whether or not *am_sync gets set depends not only on whether MyWalSnd is claiming to be synchronous but on how many lower-numbered walsenders are too. Is that really the right thing? But worse than any of that is that the return value seems to be a list of walsender array indexes, meaning that the callers cannot use it without making even stronger assumptions about the array contents not having changed since the start of this function. It sort of looks like the design is based on the assumption that the array contents can't change while SyncRepLock is held ... but if that's the plan then why bother with the per-walsender spinlocks? In any case this assumption seems to be failing, suggesting either that there's a caller that's not holding SyncRepLock when it calls this function, or that somebody is failing to take that lock while modifying the array. As far as I read the code, that assumption seems still valid. But the problem is that each walsender updates MyWalSnd->sync_standby_priority at each convenient timing, when SIGHUP is signaled. That is, at a certain moment, some walsenders (also their WalSnd entries in shmem) work based on the latest configuration but the others (also their WalSnd entries) work based on the old one. lowest_priority = SyncRepConfig->nmembers; next_highest_priority = lowest_priority + 1; SyncRepGetSyncStandbysPriority() calculates the lowest priority among all running walsenders as the above, by using the configuration info that this walsender is based on. But this calculated lowest priority would be invalid if other walsender is based on different (e.g., old) configuraiton. This can cause the (other) walsender to have lower priority than the calculated lowest priority and the second loop in SyncRepGetSyncStandbysPriority() to unexpectedly end. Therefore, the band-aid fix seems to be to set the lowest priority to very large number at the beginning of SyncRepGetSyncStandbysPriority(). Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
Re: error context for vacuum to include block number
On Fri, Mar 27, 2020 at 09:49:29AM +0530, Amit Kapila wrote: > On Fri, Mar 27, 2020 at 3:47 AM Justin Pryzby wrote: > > > > > Hm, I was just wondering what happens if an error happens *during* > > > update_vacuum_error_cbarg(). It seems like if we set > > > errcbarg->phase=VACUUM_INDEX before setting errcbarg->indname=indname, > > > then an > > > error would cause a crash. > > > > > Can't that be avoided if you check if cbarg->indname is non-null in > vacuum_error_callback as we are already doing for > VACUUM_ERRCB_PHASE_TRUNCATE? > > > > And if we pfree and set indname before phase, it'd > > > be a problem when going from an index phase to non-index phase. > > How is it possible that we move to the non-index phase without > clearing indname as we always revert back the old phase information? The crash scenario I'm trying to avoid would be like statement_timeout or other asynchronous event occurring between two non-atomic operations. I said that there's an issue no matter what order we set indname/phase; If we wrote: |cbarg->indname = indname; |cbarg->phase = phase; ..and hit a timeout (or similar) between setting indname=NULL but before setting phase=VACUUM_INDEX, then we can crash due to null pointer. But if we write: |cbarg->phase = phase; |if (cbarg->indname) {pfree(cbarg->indname);} |cbarg->indname = indname ? pstrdup(indname) : NULL; ..then we can still crash if we timeout between freeing cbarg->indname and setting it to null, due to acccessing a pfreed allocation. > > > So maybe we > > > have to set errcbarg->phase=VACUUM_ERRCB_PHASE_UNKNOWN while in the > > > function, > > > and errcbarg->phase=phase last. > > I find that a bit ad-hoc, if possible, let's try to avoid it. I think we can do what you suggesting, if the callback checks if (cbarg->indname!=NULL). We'd have to write: // Must set indname *before* updating phase, in case an error occurs before // phase is set, to avoid crashing if we're going from an index phase to a // non-index phase (which should not read indname). Must not free indname // until it's set to null. char *tmp = cbarg->indname; cbarg->indname = indname ? pstrdup(indname) : NULL; cbarg->phase = phase; if (tmp){pfree(tmp);} Do you think that's better ? -- Justin
Re: backup manifests
Hi, On 2020-03-26 11:37:48 -0400, Robert Haas wrote: > I mean, you're just repeating the same argument here, and it's just > not valid. Regardless of the file size, the chances of a false > checksum match are literally less than one in a billion. There is > every reason to believe that users will be happy with a low-overhead > method that has a 99.999+% chance of detecting corrupt files. I do > agree that a 64-bit CRC would probably be not much more expensive and > improve the probability of detecting errors even further I *seriously* doubt that it's true that 64bit CRCs wouldn't be slower. The only reason CRC32C is semi-fast is that we're accelerating it using hardware instructions (on x86-64 and ARM at least). Before that it was very regularly the bottleneck for processing WAL - and it still sometimes is. Most CRCs aren't actually very fast to compute, because they don't lend themselves to benefit from ILP or SIMD. We spent a fair bit of time optimizing our crc implementation before the hardware support was widespread. > but I wanted to restrict this patch to using infrastructure we already > have. The choices there are the various SHA functions (so I supported > those), MD5 (which I deliberately omitted, for reasons I hope you'll > be the first to agree with), CRC-32C (which is fast), a couple of > other CRC-32 variants (which I omitted because they seemed redundant > and one of them only ever existed in PostgreSQL because of a coding > mistake), and the hacked-up version of FNV that we use for page-level > checksums (which is only 16 bits and seems to have no advantages for > this purpose). FWIW, FNV is only 16bit because we reduce its size to 16 bit. See the tail of pg_checksum_page. I'm not sure the error detection guarantees of various CRC algorithms are that relevant here, btw. IMO, for something like checksums in a backup, just having a single one-bit error isn't as common as having larger errors (e.g. entire blocks beeing zeroed). And to detect that 32bit checksums aren't that good. > > As for folks who are that close to the edge on their backup timing that > > they can't have it slow down- chances are pretty darn good that they're > > not far from ending up needing to find a better solution than > > pg_basebackup anyway. Or they don't need to generate a manifest (or, I > > suppose, they could have one but not have checksums..). > > 40-50% is a lot more than "if you were on the edge." sha256 does about approx 400MB/s per core on modern intel CPUs. That's way below commonly accessible storage / network capabilities (and even if you're only doing 200MB/s, you're still going to spend roughly half of the CPU time just doing hashing. It's unlikely that you're going to see much speedups for sha256 just by upgrading a CPU. While there are hardware instructions available, they don't result in all that large improvements. Of course, we could also start using the GPU (err, really no). Defaulting to that makes very little sense to me. You're not just going to spend that time while backing up, but also when validating backups (i.e. network limits suddenly aren't a relevant bottleneck anymore). > > I fail to see the usefulness of a tool that doesn't actually verify that > > the backup is able to be restored from. > > > > Even pg_basebackup (in both fetch and stream modes...) checks that we at > > least got all the WAL that's needed for the backup from the server > > before considering the backup to be valid and telling the user that > > there was a successful backup. With what you're proposing here, we > > could have someone do a pg_basebackup, get back an ERROR saying the > > backup wasn't valid, and then run pg_validatebackup and be told that the > > backup is valid. I don't get how that's sensible. > > I'm sorry that you can't see how that's sensible, but it doesn't mean > that it isn't sensible. It is totally unrealistic to expect that any > backup verification tool can verify that you won't get an error when > trying to use the backup. That would require that everything that the > validation tool try to do everything that PostgreSQL will try to do > when the backup is used, including running recovery and updating the > data files. Anything less than that creates a real possibility that > the backup will verify good but fail when used. This tool has a much > narrower purpose, which is to try to verify that we (still) have the > files the server sent as part of the backup and that, to the best of > our ability to detect such things, they have not been modified. As you > know, or should know, the WAL files are not sent as part of the > backup, and so are not verified. Other things that would also be > useful to check are also not verified. It would be fantastic to have > more verification tools in the future, but it is difficult to see why > anyone would bother trying if an attempt to get the first one > committed gets blocked because it does not yet do everything. Very few > patches
Re: error context for vacuum to include block number
On Fri, Mar 27, 2020 at 3:47 AM Justin Pryzby wrote: > > > > Hm, I was just wondering what happens if an error happens *during* > > update_vacuum_error_cbarg(). It seems like if we set > > errcbarg->phase=VACUUM_INDEX before setting errcbarg->indname=indname, then > > an > > error would cause a crash. > > Can't that be avoided if you check if cbarg->indname is non-null in vacuum_error_callback as we are already doing for VACUUM_ERRCB_PHASE_TRUNCATE? > > And if we pfree and set indname before phase, it'd > > be a problem when going from an index phase to non-index phase. How is it possible that we move to the non-index phase without clearing indname as we always revert back the old phase information? I think it is possible only if we don't clear indname after the phase is over. > > So maybe we > > have to set errcbarg->phase=VACUUM_ERRCB_PHASE_UNKNOWN while in the > > function, > > and errcbarg->phase=phase last. I find that a bit ad-hoc, if possible, let's try to avoid it. > > And addressed that. > > Also, I realized that lazy_cleanup_index has an early "return", so the "Revert > back" was ineffective. > We can call it immediately after index_vacuum_cleanup to avoid that. > We talked about how that wasn't needed, since we never > go back to a previous phase. Amit wanted to keep it there for consistency, > but > I'd prefer to put any extra effort into calling out the special treatment > needed/given to lazy_vacuum_heap/index, rather than making everything > "consistent". > Apart from being consistent, the point was it doesn't seem good that API being called to assume that there is nothing more the caller can do. It might be problematic if we later want to enhance or add something to the caller. > Amit: I also moved the TRUNCATE_HEAP bit back to truncate_heap(), There is no problem with it. We can do it either way and I have also considered it the way you have done but decide to keep in the caller because of the previous point I mentioned (not sure if it a good idea that API being called can assume that there is nothing more the caller can do after this). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
> Another issue is this: > > +VACUUM ( FULL [, ...] ) [ TABLESPACE > class="parameter">new_tablespace ] [ > class="parameter">table_and_columns [, ...] ] > As you mentioned in your v1 patch, in the other cases, "tablespace > [tablespace]" is added at the end of the command rather than in the middle. I > wasn't able to make that work, maybe because "tablespace" isn't a fully > reserved word (?). I didn't try with "SET TABLESPACE", although I understand > it'd be better without "SET". I think we should use the parenthesized syntax for vacuum - it seems clear in hindsight. Possibly REINDEX should use that, too, instead of adding OptTablespace at the end. I'm not sure. CLUSTER doesn't support parenthesized syntax, but .. maybe it should? Also, perhaps VAC FULL (and CLUSTER, if it grows parenthesized syntax), should support something like this: USING INDEX TABLESPACE name I guess I would prefer just "index tablespace", without "using": |VACUUM(FULL, TABLESPACE ts, INDEX TABLESPACE its) t; |CLUSTER(VERBOSE, TABLESPACE ts, INDEX TABLESPACE its) t; -- Justin
Re: Some problems of recovery conflict wait events
On 2020/03/26 14:33, Masahiko Sawada wrote: On Tue, 24 Mar 2020 at 17:04, Fujii Masao wrote: On 2020/03/05 20:16, Fujii Masao wrote: On 2020/03/05 16:58, Masahiko Sawada wrote: On Wed, 4 Mar 2020 at 15:21, Fujii Masao wrote: On 2020/03/04 14:31, Masahiko Sawada wrote: On Wed, 4 Mar 2020 at 13:48, Fujii Masao wrote: On 2020/03/04 13:27, Michael Paquier wrote: On Wed, Mar 04, 2020 at 01:13:19PM +0900, Masahiko Sawada wrote: Yeah, so 0001 patch sets existing wait events to recovery conflict resolution. For instance, it sets (PG_WAIT_LOCK | LOCKTAG_TRANSACTION) to the recovery conflict on a snapshot. 0003 patch improves these wait events by adding the new type of wait event such as WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT. Therefore 0001 (and 0002) patch is the fix for existing versions and 0003 patch is an improvement for only PG13. Did you mean even 0001 patch doesn't fit for back-patching? Yes, it looks like a improvement rather than bug fix. Okay, understand. I got my eyes on this patch set. The full patch set is in my opinion just a set of improvements, and not bug fixes, so I would refrain from back-backpatching. I think that the issue (i.e., "waiting" is reported twice needlessly in PS display) that 0002 patch tries to fix is a bug. So it should be fixed even in the back branches. So we need only two patches: one fixes process title issue and another improve wait event. I've attached updated patches. Thanks for updating the patches! I started reading 0001 patch. Thank you for reviewing this patch. - /* -* Report via ps if we have been waiting for more than 500 msec -* (should that be configurable?) -*/ - if (update_process_title && new_status == NULL && - TimestampDifferenceExceeds(waitStart, GetCurrentTimestamp(), - 500)) The patch changes ResolveRecoveryConflictWithSnapshot() and ResolveRecoveryConflictWithTablespace() so that they always add "waiting" into the PS display, whether wait is really necessary or not. But isn't it better to display "waiting" in PS basically when wait is necessary, like originally ResolveRecoveryConflictWithVirtualXIDs() does as the above? You're right. Will fix it. ResolveRecoveryConflictWithDatabase(Oid dbid) { + char*new_status = NULL; + + /* Report via ps we are waiting */ + new_status = set_process_title_waiting(); In ResolveRecoveryConflictWithDatabase(), there seems no need to display "waiting" in PS because no wait occurs when recovery conflict with database happens. Isn't the startup process waiting for other backend to terminate? Yeah, you're right. I agree that "waiting" should be reported in this case. On second thought, in recovery conflict case, "waiting" should be reported while waiting for the specified delay (e.g., by max_standby_streaming_delay) until the conflict is resolved. So IMO reporting "waiting" in the case of recovery conflict with buffer pin, snapshot, lock and tablespace seems valid, because they are user-visible "expected" wait time. However, in the case of recovery conflict with database, the recovery basically doesn't wait at all and just terminates the conflicting sessions immediately. Then the recovery waits for all those sessions to be terminated, but that wait time is basically small and should not be the user-visible. If that wait time becomes very long because of unresponsive backend, ISTM that LOG or WARNING should be logged instead of reporting something in PS display. I'm not sure if that logging is really necessary now, though. Therefore, I'm thinking that "waiting" doesn't need to be reported in the case of recovery conflict with database. Thought? Fair enough. The longer wait time of conflicts with database is not user-expected behaviour so logging would be better. I'd like to just drop the change around ResolveRecoveryConflictWithDatabase() from the patch. Make sense. + if (update_process_title) + waitStart = GetCurrentTimestamp(); Since LockBufferForCleanup() can be called very frequently, I don't think that it's good thing to call GetCurrentTimestamp() every time when LockBufferForCleanup() is called. + /* Report via ps if we have been waiting for more than 500 msec */ + if (update_process_title && new_status == NULL && + TimestampDifferenceExceeds(waitStart, GetCurrentTimestamp(), + 500)) Do we really want to see "waiting" in PS even in non hot standby mode? If max_standby_streaming_delay and deadlock_timeout are set to large value, ResolveRecoveryConflictWithBufferPin() can wait for a long time, e.g., more than 500ms. In that case, I'm afraid that "report
Re: Memory-Bounded Hash Aggregation
On Thu, Mar 26, 2020 at 05:56:56PM +0800, Richard Guo wrote: Hello, When calculating the disk costs of hash aggregation that spills to disk, there is something wrong with how we determine depth: depth = ceil( log(nbatches - 1) / log(num_partitions) ); If nbatches is some number between 1.0 and 2.0, we would have a negative depth. As a result, we may have a negative cost for hash aggregation plan node, as described in [1]. I don't think 'log(nbatches - 1)' is what we want here. Should it be just '(nbatches - 1)'? I think using log() is correct, but why should we allow fractional nbatches values between 1.0 and 2.0? You either have 1 batch or 2 batches, you can't have 1.5 batches. So I think the issue is here nbatches = Max((numGroups * hashentrysize) / mem_limit, numGroups / ngroups_limit ); and we should probably do nbatches = ceil(nbatches); right after it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Race condition in SyncRepGetSyncStandbysPriority
Twice in the past month [1][2], buildfarm member hoverfly has managed to reach the "unreachable" Assert(false) at the end of SyncRepGetSyncStandbysPriority. What seems likely to me, after quickly eyeballing the code, is that hoverfly is hitting the blatantly-obvious race condition in that function. Namely, that the second loop supposes that the state of the walsender array hasn't changed since the first loop. The minimum fix for this, I suppose, would have the first loop capture the sync_standby_priority value for each walsender along with what it's already capturing. But I wonder if the whole function shouldn't be rewritten from scratch, because it seems like the algorithm is both expensively brute-force and unintelligible, which is a sad combination. It's likely that the number of walsenders would never be high enough that efficiency could matter, but then couldn't we use an algorithm that is less complicated and more obviously correct? (Because the alternative conclusion, if you reject the theory that a race is happening, is that the algorithm is just flat out buggy; something that's not too easy to disprove either.) Another fairly dubious thing here is that whether or not *am_sync gets set depends not only on whether MyWalSnd is claiming to be synchronous but on how many lower-numbered walsenders are too. Is that really the right thing? But worse than any of that is that the return value seems to be a list of walsender array indexes, meaning that the callers cannot use it without making even stronger assumptions about the array contents not having changed since the start of this function. It sort of looks like the design is based on the assumption that the array contents can't change while SyncRepLock is held ... but if that's the plan then why bother with the per-walsender spinlocks? In any case this assumption seems to be failing, suggesting either that there's a caller that's not holding SyncRepLock when it calls this function, or that somebody is failing to take that lock while modifying the array. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-02-29%2001%3A34%3A55 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-03-26%2013%3A51%3A15
Re: [HACKERS] make async slave to wait for lsn to be replayed
Anna, thank you for your review. On 2020-03-25 21:10, Anna Akenteva wrote: On 2020-03-21 14:16, Kartyshov Ivan wrote: and event is: LSN value [options] TIMESTAMP value I would maybe remove WAIT FOR TIMESTAMP. As Robert Haas has pointed out, it seems a lot like pg_sleep_until(). Besides, it doesn't necessarily need to be connected to transaction start, which makes it different from WAIT FOR LSN - so I wouldn't mix them together. I don't mind. But I think we should get one more opinions on this point. === This is how WaitUtility() is called - note that time_val will always be > 0: + if (time_val <= 0) + time_val = 1; +... + res = WaitUtility(lsn, (int)(time_val * 1000), dest); (time_val * 1000) is passed to WaitUtility() as the delay argument. And inside WaitUtility() we have this: +if (delay > 0) + latch_events = WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH; +else + latch_events = WL_LATCH_SET | WL_POSTMASTER_DEATH; Since we always pass a delay value greater than 0, we'll never get to the "else" clause here and we'll never be ready to wait for LSN forever. Perhaps due to that, the current test outputs this after a simple WAIT FOR LSN command: psql::1: NOTICE: LSN is not reached. I fix it, and Interruptions in last patch. Anna, feel free to work on this patch. -- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres Companydiff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml index 8d91f3529e..8697f9807f 100644 --- a/doc/src/sgml/ref/allfiles.sgml +++ b/doc/src/sgml/ref/allfiles.sgml @@ -187,6 +187,7 @@ Complete list of usable sgml source files in this directory. + diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml index c23bbfb4e7..45289c0173 100644 --- a/doc/src/sgml/ref/begin.sgml +++ b/doc/src/sgml/ref/begin.sgml @@ -21,13 +21,25 @@ PostgreSQL documentation -BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] +BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] wait_for_event where transaction_mode is one of: ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED } READ WRITE | READ ONLY [ NOT ] DEFERRABLE + +wait_for_event +WAIT FOR [ANY | ALL] event [, event ...] + +where event is: +LSN value [options] +TIMESTAMP value + +and where options is one of: +TIMEOUT delay +UNTIL TIMESTAMP timestamp + diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml index d6cd1d4177..01b402e9cd 100644 --- a/doc/src/sgml/ref/start_transaction.sgml +++ b/doc/src/sgml/ref/start_transaction.sgml @@ -21,13 +21,24 @@ PostgreSQL documentation -START TRANSACTION [ transaction_mode [, ...] ] +START TRANSACTION [ transaction_mode [, ...] ] wait_for_event where transaction_mode is one of: ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED } READ WRITE | READ ONLY [ NOT ] DEFERRABLE + +wait_for_event +WAIT FOR [ANY | SOME | ALL] event [, event ...] + +where event is: +LSN value [options] +TIMESTAMP value + +and where options is one of: +TIMEOUT delay +UNTIL TIMESTAMP timestamp diff --git a/doc/src/sgml/ref/wait.sgml b/doc/src/sgml/ref/wait.sgml new file mode 100644 index 00..b824088f6c --- /dev/null +++ b/doc/src/sgml/ref/wait.sgml @@ -0,0 +1,148 @@ + + + + + WAIT FOR + + + + WAIT FOR + 7 + SQL - Language Statements + + + + WAIT FOR + wait for the target LSN to be replayed + + + + +WAIT FOR [ANY | ALL] event [, event ...] + +where event is: +LSN value [options] +TIMESTAMP value + +and where options is one of: +TIMEOUT delay +UNTIL TIMESTAMP timestamp + +WAIT FOR LSN 'lsn_number' +WAIT FOR LSN 'lsn_number' TIMEOUT wait_timeout +WAIT FOR LSN 'lsn_number' UNTIL TIMESTAMP wait_time +WAIT FOR TIMESTAMP wait_time +WAIT FOR ALL LSN 'lsn_number' TIMEOUT wait_timeout, TIMESTAMP wait_time +WAIT FOR ANY LSN 'lsn_number', TIMESTAMP wait_time + + + + + Description + + + WAIT FOR provides a simple + interprocess communication mechanism to wait for timestamp or the target log sequence + number (LSN) on standby in PostgreSQL + databases with master-standby asynchronous replication. When run with the + LSN option, the WAIT FOR command + waits for the specified LSN to be replayed. By default, wait + time is unlimited. Waiting can be interrupted using Ctrl+C, or + by shutting down the postgres server. You can also limit the wait + time using the TIMEOUT option. + + + + + + Parameters + + + +LSN + + + Specify the target log sequence number to wait for. + + + + + +TIMEOUT wait_timeout + + + Limit the time interval to wait for the LSN to be replayed. + The specified wait_timeout must be an integer + and is measured in milliseconds. + + + + +
Re: SLRU statistics
Hi, Attached is v3 of the patch with one big change and various small ones. The main change is that it gets rid of the SlruType enum and the new field in SlruCtlData. Instead, the patch now uses the name passed to SimpleLruInit (which is then stored as LWLock tranche name). The counters are still stored in a fixed-sized array, and there's a simple name/index mapping. We don't have any registry of stable SLRU IDs, so I can't think of anything better, and I think this is good enough for now. The other change is that I got rid of the io_error counter. We don't have that for shared buffers etc. either, anyway. I've also renamed the colunms from "pages" to "blks" to make it consistent with other similar stats (blks_hit, blks_read). I've renamed the fields to "blks_written" and "blks_zeroed". And finally, I've added the view to monitoring.sgml. Barring objections, I'll get this committed in the next few days, after reviewing the comments a bit. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From e1a187b9b331798c87900f94aa77999f9198f556 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Thu, 26 Mar 2020 20:52:26 +0100 Subject: [PATCH] SLRU stats --- doc/src/sgml/monitoring.sgml | 77 + src/backend/access/transam/slru.c| 23 +++ src/backend/catalog/system_views.sql | 13 ++ src/backend/postmaster/pgstat.c | 238 +++ src/backend/utils/adt/pgstatfuncs.c | 77 + src/include/catalog/pg_proc.dat | 10 ++ src/include/pgstat.h | 53 +- src/test/regress/expected/rules.out | 10 ++ 8 files changed, 500 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 270178d57e..b58ac5acb8 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -575,6 +575,13 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser yet included in pg_stat_user_functions). + + pg_stat_slrupg_stat_slru + One row per SLRU, showing statistics of operations. See +for details. + + + @@ -3254,6 +3261,76 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i + + The pg_stat_slru view will contain + one row for each tracked SLRU cache, showing statistics about access + to cached pages. + + + + pg_stat_slru View + + + + Column + Type + Description + + + + + + name + name + name of the SLRU + + + blks_zeroed + bigint + Number of blocks zeroed during initializations + + + blks_hit + biging + Number of times disk blocks were found already in the SLRU, + so that a read was not necessary (this only includes hits in the + SLRU, not the operating system's file system cache) + + + + blks_read + bigint + Number of disk blocks read for this SLRU + + + blks_written + bigint + Number of disk blocks written for this SLRU + + + blks_exists + bigint + Number of blocks checked for existence for this SLRU + + + flushes + bigint + Number of flushes of dirty data for this SLRU + + + truncates + bigint + Number of truncates for this SLRU + + + stats_reset + timestamp with time zone + Time at which these statistics were last reset + + + + + The pg_stat_user_functions view will contain one row for each tracked function, showing statistics about executions of diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c index d5b7a08f73..f7160dd574 100644 --- a/src/backend/access/transam/slru.c +++ b/src/backend/access/transam/slru.c @@ -286,6 +286,9 @@ SimpleLruZeroPage(SlruCtl ctl, int pageno) /* Assume this page is now the latest active page */ shared->latest_page_number = pageno; + /* update the stats counter of zeroed pages */ + pgstat_slru_count_page_zeroed(ctl); + return slotno; } @@ -403,6 +406,10 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok, } /* Otherwise, it's ready to use */ SlruRecentlyUsed(shared, slotno); + + /* update the stats counter of pages found in the SLRU */ + pgstat_slru_count_page_hit(ctl); + return slotno; } @@ -444,6 +451,10 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok, SlruReportIOError(ctl, pageno, xid); SlruRecentlyUsed(shared, slotno); + + /* update the stats counter of pages not found in SLRU */ + pgstat_slru_count_page_read(ctl); + return slotno; }
Re: error context for vacuum to include block number
On 2020-Mar-26, Justin Pryzby wrote: > On Thu, Mar 26, 2020 at 07:49:51PM -0300, Alvaro Herrera wrote: > > BTW I'm pretty sure this "revert back" phrasing is not good English -- > > you should just use "revert". Maybe get some native speaker's opinion > > on it. > > I'm a native speaker; "revert back" might be called redundant but I think it's > common usage. Oh, okay. > > And speaking of language, I find the particle "cbarg" rather very ugly, > > I renamed it since it was kind of opaque looking. It's in all the same > places, > so equally infectious; but I hope you like it better. I like it much better, thanks :-) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: allow online change primary_conninfo
Now, would anyone be too upset if I push these in this other order? I realized that the reason the tests broke after Sergei's patch is that recovery/t/001_stream_rep.pl's get_slot_xmins() is broken for temp walreceiver slots, since it's using the non-temp name it tries to give to the slot, rather than the temp name under which it is actually created. The workaround proposed by 0002 is to edit standby_1's config to set walreceiver's slot to be non-temp. Thanks to Justin Pryzby for offlist typo corrections. The reason is that I think I would like to get Sergei's patch pushed right away (0001+0002, as a single commit); but that I think there's more to attack in the walreceiver temp slot code than 0003 here, and I don't want to delay the new feature any longer while I figure out the fix for that. (The thing is: if I specify primary_slot_name in the config, why is the temp walreceiver slot code not obeying that name? I think walreceiver should create a temp slot, sure, but using the given name rather than coming up with a random name.) (The other reason is that I would like to push one patch to fix walreceiver tmp slot rather than two, setting the thing first this way and then the opposite way.) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From f7a9717caa9dd73c00fb5e6a1dddb354ad78d09a Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 25 Mar 2020 20:48:56 -0300 Subject: [PATCH v11 1/3] Allow changing primary_conninfo and primary_slot_name in SIGHUP --- doc/src/sgml/config.sgml | 16 ++- src/backend/access/transam/xlog.c | 130 +++--- src/backend/access/transam/xlogreader.c | 6 +- src/backend/postmaster/startup.c | 30 +++- src/backend/replication/walreceiver.c | 12 +- src/backend/utils/misc/guc.c | 4 +- src/backend/utils/misc/postgresql.conf.sample | 2 - src/include/access/xlog.h | 2 + src/test/recovery/t/001_stream_rep.pl | 24 +++- 9 files changed, 165 insertions(+), 61 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 355b408b0a..1cf88e953d 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -4028,9 +4028,15 @@ ANY num_sync ( @@ -4045,9 +4051,13 @@ ANY num_sync ( ). - This parameter can only be set at server start. + This parameter can only be set in the postgresql.conf + file or on the server command line. This setting has no effect if primary_conninfo is not - set. + set or the server is not in standby mode. + + + The WAL receiver is restarted after an update of primary_slot_name. diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 7621fc05e2..bbf8d4eee5 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -815,9 +815,13 @@ static XLogSource readSource = XLOG_FROM_ANY; * currently have a WAL file open. If lastSourceFailed is set, our last * attempt to read from currentSource failed, and we should try another source * next. + * + * pendingWalRcvRestart is set when a config change occurs that requires a + * walreceiver restart. This is only valid in XLOG_FROM_STREAM state. */ static XLogSource currentSource = XLOG_FROM_ANY; static bool lastSourceFailed = false; +static bool pendingWalRcvRestart = false; typedef struct XLogPageReadPrivate { @@ -11904,6 +11908,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, for (;;) { XLogSource oldSource = currentSource; + bool startWalReceiver = false; /* * First check if we failed to read from the current source, and @@ -11938,53 +11943,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, return false; /* - * If primary_conninfo is set, launch walreceiver to try - * to stream the missing WAL. - * - * If fetching_ckpt is true, RecPtr points to the initial - * checkpoint location. In that case, we use RedoStartLSN - * as the streaming start position instead of RecPtr, so - * that when we later jump backwards to start redo at - * RedoStartLSN, we will have the logs streamed already. - */ - if (PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0) - { - XLogRecPtr ptr; - TimeLineID tli; - - if (fetching_ckpt) - { - ptr = RedoStartLSN; - tli = ControlFile->checkPointCopy.ThisTimeLineID; - } - else - { - ptr = RecPtr; - - /* - * Use the record begin position to determine the - * TLI, rather than the position we're reading. - */ - tli = tliOfPointInHistory(tliRecPtr, expectedTLEs); - - if (curFileTLI > 0 && tli < curFileTLI) -elog(ERROR, "according to history file, WAL lo
RE: Conflict handling for COPY FROM
Hello Surafel, From: Surafel Temesgen >An error that can be surly handled without transaction rollback can >be included in error handling but i will like to proceed without binary file >errors handling for the time being Thank you. Also it seems that you apply Alexey's comment. So I'll mark this patch as ready for commiter. Regards, -- Takanori Asaba
Re: error context for vacuum to include block number
On Thu, Mar 26, 2020 at 07:49:51PM -0300, Alvaro Herrera wrote: > > > ... So once we've "reverted back", 1) the pointer is null; and, 2) > > > the callback function doesn't access it for the previous/reverted > > > phase anyway. > > BTW I'm pretty sure this "revert back" phrasing is not good English -- > you should just use "revert". Maybe get some native speaker's opinion > on it. I'm a native speaker; "revert back" might be called redundant but I think it's common usage. > And speaking of language, I find the particle "cbarg" rather very ugly, > and it's *everywhere* -- function name, function argument, local > variable, enum values, enum name. It even spread to the typedefs.list > file! Is this a new virus??? Put some soap in it! Can't we use "info" > or "state" or something similar, less infectious, instead? I renamed it since it was kind of opaque looking. It's in all the same places, so equally infectious; but I hope you like it better. Cheers, -- Justin >From 4d33dc1c125690b48dc0028c63a663db53ac0311 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Thu, 12 Dec 2019 20:54:37 -0600 Subject: [PATCH v37 1/3] Introduce vacuum errcontext to display additional information. The additional information displayed will be block number for error occurring while processing heap and index name for error occurring while processing the index. This will help us in diagnosing the problems that occur during a vacuum. For ex. due to corruption (either caused by bad hardware or by some bug) if we get some error while vacuuming, it can help us identify the block in heap and or additional index information. It sets up an error context callback to display additional information with the error. During different phases of vacuum (heap scan, heap vacuum, index vacuum, index clean up, heap truncate), we update the error context callback to display appropriate information. We can extend it to a bit more granular level like adding the phases for FSM operations or for prefetching the blocks while truncating. However, I felt that it requires adding many more error callback function calls and can make the code a bit complex, so left those for now. Author: Justin Pryzby, with few changes by Amit Kapila Reviewed-by: Alvaro Herrera, Amit Kapila, Andres Freund, Michael Paquier and Sawada Masahiko Discussion: https://www.postgresql.org/message-id/20191120210600.gc30...@telsasoft.com --- src/backend/access/heap/vacuumlazy.c | 240 --- src/tools/pgindent/typedefs.list | 1 + 2 files changed, 216 insertions(+), 25 deletions(-) diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index 03c43efc32..973f411caf 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -144,6 +144,17 @@ */ #define ParallelVacuumIsActive(lps) PointerIsValid(lps) +/* Phases of vacuum during which we report error context. */ +typedef enum +{ + VACUUM_ERRCB_PHASE_UNKNOWN, + VACUUM_ERRCB_PHASE_SCAN_HEAP, + VACUUM_ERRCB_PHASE_VACUUM_INDEX, + VACUUM_ERRCB_PHASE_VACUUM_HEAP, + VACUUM_ERRCB_PHASE_INDEX_CLEANUP, + VACUUM_ERRCB_PHASE_TRUNCATE +} VacErrPhase; + /* * LVDeadTuples stores the dead tuple TIDs collected during the heap scan. * This is allocated in the DSM segment in parallel mode and in local memory @@ -270,6 +281,8 @@ typedef struct LVParallelState typedef struct LVRelStats { + char *relnamespace; + char *relname; /* useindex = true means two-pass strategy; false means one-pass */ bool useindex; /* Overall statistics about rel */ @@ -290,8 +303,12 @@ typedef struct LVRelStats int num_index_scans; TransactionId latestRemovedXid; bool lock_waiter_detected; -} LVRelStats; + /* Used for error callback */ + char *indname; + BlockNumber blkno; /* used only for heap operations */ + VacErrPhase phase; +} LVRelStats; /* A few variables that don't seem worth passing around as parameters */ static int elevel = -1; @@ -314,10 +331,10 @@ static void lazy_vacuum_all_indexes(Relation onerel, Relation *Irel, LVRelStats *vacrelstats, LVParallelState *lps, int nindexes); static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats, - LVDeadTuples *dead_tuples, double reltuples); + LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats); static void lazy_cleanup_index(Relation indrel, IndexBulkDeleteResult **stats, - double reltuples, bool estimated_count); + double reltuples, bool estimated_count, LVRelStats *vacrelstats); static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer, int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer); static bool should_attempt_truncation(VacuumParams *params, @@ -337,13 +354,13 @@ static void lazy_parallel_vacuum_indexes(Relation *Irel, IndexBulkDeleteResult * int nindexes); static void parallel_vacuum_index(Relation *Irel, IndexBul
Re: error context for vacuum to include block number
On 2020-Mar-26, Justin Pryzby wrote: > On Thu, Mar 26, 2020 at 10:04:57AM -0500, Justin Pryzby wrote: > > And I think you're right: we only save state when the calling function has a > > indname=NULL, so we never "put back" a non-NULL indname. We go from having > > a > > indname=NULL at lazy_scan_heap to not not-NULL at lazy_vacuum_index, and > > never > > the other way around. > > I removed the free_oldindname argument. Hah, I was wondering about that free_oldindname business this morning as well. > > ... So once we've "reverted back", 1) the pointer is null; and, 2) > > the callback function doesn't access it for the previous/reverted > > phase anyway. BTW I'm pretty sure this "revert back" phrasing is not good English -- you should just use "revert". Maybe get some native speaker's opinion on it. And speaking of language, I find the particle "cbarg" rather very ugly, and it's *everywhere* -- function name, function argument, local variable, enum values, enum name. It even spread to the typedefs.list file! Is this a new virus??? Put some soap in it! Can't we use "info" or "state" or something similar, less infectious, instead? Thanks -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On Thu, Mar 26, 2020 at 06:56:36PM -0300, Alvaro Herrera wrote: > Uh, is XLOGDIR the only reason to include xlog_internal.h? Maybe it > would be easier to have a routine in xlog.c that returns the path? > There's a few functions in xlog.c that could use it, as well. Yep, that's all. We could also just hardcode the path as we did when we worked on the exclusion filter lists for pg_rewind and basebackup.c, though I'd prefer avoid that if possible. > The patch downthread looks decent cleanup, but I'm not sure how it helps > further the objective. I actually think it does, because you get out of the way the conflicts caused by RestoreArchivedFile() in the backend, as we are targetting for fe_archive.c to be a frontend-only file, though I agree that it is not the full of it. > (A really good cleanup could be a situation where frontend files don't > need xlog_internal.h -- for example, maybe a new file xlogpage.h could > contain struct defs that relate to page and segment headers and the > like, as well as useful macros. I don't know if this can be made to > work -- but xlog_internal.h contains stuff like xl_parameter_change etc > as well as RmgrData which surely are of no interest to readers of wal > files ... or, say, RequestXLogSwitch.) A second header that could be created is xlogpaths.h (or xlogfiles.h?) that includes all the routines and variables we use to build WAL segment names and such, with XLogFileNameById, IsTLHistoryFileName, etc. I agree that splitting the record-related parts may make sense, say xlogrecovery.h? > I don't think any such cleanup should hamper the patch at hand anyway. I don't think either, so I would do the split for the archive routines at least. It still feels strange to me to have a different routine name for the frontend-side of RestoreArchivedFile(). That's not really consistent with the current practice we have palloc() & co, as well as the sync routines. -- Michael signature.asc Description: PGP signature
Re: Include sequence relation support in logical replication
Hi Andres thanks for your reply and your patch review. Please see my comments below >On 2020-03-24 16:19:21 -0700, Cary Huang wrote: >> I have shared a patch that allows sequence relation to be supported in >> logical replication via the decoding plugin ( test_decoding for >> example ); it does not support sequence relation in logical >> replication between a PG publisher and a PG subscriber via pgoutput >> plugin as it will require much more work. > > Could you expand on "much more work"? Once decoding support is there, > that shouldn't be that much? By much more work, I meant more source files will need to be changed to have sequence replication supported between a PG subscriber and publisher using pgoutput plugin. About 10 more source file changes. Idea is similar though. >> Sequence changes caused by other sequence-related SQL functions like >> setval() or ALTER SEQUENCE xxx, will always emit a WAL update, so >> replicating changes caused by these should not be a problem. > > I think this really would need to handle at the very least setval to > make sense. yes, sure >> For the replication to make sense, the patch actually disables the WAL >> update at every 32 nextval() calls, so every call to nextval() will >> emit a WAL update for proper replication. This is done by setting >> SEQ_LOG_VALS to 0 in sequence.c > > Why is that needed? ISTM updating in increments of 32 is fine for > replication purposes? It's good imo, because sending out more granular > increments would increase the size of the WAL stream? yes, updating WAL at every 32 increment is good and have huge performance benefits according to Michael, but when it is replicated logically to subscribers, the sequence value they receive would not make much sense to them. For example, If i have a Sequence called "seq" with current value = 100 and increment = 5. The nextval('seq') call will return 105 to the client but it will write 260 to WAL record ( 100 + (5*32) ), because that is the value after 32 increments and internally it is also maintaining a "log_cnt" counter that tracks how many nextval() calls have been invoked since the last WAL write, so it could kind of derive backwards to find the proper next value to return to client. But the subscriber for this sequence will receive a change of 260 instead of 105, and it does not represent the current sequence status. Subscriber is not able to derive backwards because it does not know the increment size in its schema. Setting SEQ_LOG_VALS to 0 in sequence.c basically disables this 32 increment behavior and makes WAL update at every nextval() call and therefore the subscriber to this sequence will receive the same value (105) as the client, as a cost of more WAL writes. I would like to ask if you have some suggestions or ideas that can make subscriber receives the current value without the need to disabling the 32 increment behavior? >> diff --git a/contrib/test_decoding/test_decoding.c >> b/contrib/test_decoding/test_decoding.c >> index 93c948856e..7a7e572d6c 100644 >> --- a/contrib/test_decoding/test_decoding.c >> +++ b/contrib/test_decoding/test_decoding.c >> @@ -466,6 +466,15 @@ pg_decode_change(LogicalDecodingContext *ctx, >> ReorderBufferTXN *txn, >> &change->data.tp.oldtuple->tuple, >> true); >> break; >> +case REORDER_BUFFER_CHANGE_SEQUENCE: >> +appendStringInfoString(ctx->out, " SEQUENCE:"); >> +if (change->data.sequence.newtuple == NULL) >> +appendStringInfoString(ctx->out, " >> (no-tuple-data)"); >> +else >> +tuple_to_stringinfo(ctx->out, tupdesc, >> + >> &change->data.sequence.newtuple->tuple, >> +false); >> +break; >> default: >> Assert(false); >> } > > You should also add tests - the main purpose of contrib/test_decoding is > to facilitate writing those... thanks, I will add >> +ReorderBufferXidSetCatalogChanges(ctx->reorder, >> XLogRecGetXid(buf->record), buf->origptr); > > Huh, why are you doing this? That's going to increase overhead of logical > decoding by many times? This is needed to allow snapshot to be built inside DecodeCommit() function. Regular changes caused by INSERT also has this function called so I assume it is needed to ensure proper decoding. Without this, a snapshot will not be built and the change transaction will not be logged >> +case REORDER_BUFFER_CHANGE_SEQUENCE: >> +Assert(snapshot_now); >> + >> +reloid = >> RelidByRelfilenode(change->data.sequence.relnode.spcNode, >> +
Re: error context for vacuum to include block number
On Thu, Mar 26, 2020 at 10:04:57AM -0500, Justin Pryzby wrote: > Does that address your comment ? I hope so. > > I'm not sure why "free_oldindname" is necessary. Since we initialize > > vacrelstats->indname with NULL and revert the callback arguments at > > the end of functions that needs update them, vacrelstats->indname is > > NULL at the beginning of lazy_vacuum_index() and lazy_cleanup_index(). > > And we make a copy of index name in update_vacuum_error_cbarg(). So I > > think we can pfree the old index name if errcbarg->indname is not NULL. > > We want to avoid doing this: > olderrcbarg = *vacrelstats // saves a pointer > update_vacuum_error_cbarg(... NULL); // frees the pointer and sets indname > to NULL > update_vacuum_error_cbarg(... olderrcbarg.oldindnam) // puts back the > pointer, which has been freed > // hit an error, and the callback accesses the pfreed pointer > > I think that's only an issue for lazy_vacuum_index(). > > And I think you're right: we only save state when the calling function has a > indname=NULL, so we never "put back" a non-NULL indname. We go from having a > indname=NULL at lazy_scan_heap to not not-NULL at lazy_vacuum_index, and never > the other way around. So once we've "reverted back", 1) the pointer is null; > and, 2) the callback function doesn't access it for the previous/reverted > phase > anyway. I removed the free_oldindname argument. > Hm, I was just wondering what happens if an error happens *during* > update_vacuum_error_cbarg(). It seems like if we set > errcbarg->phase=VACUUM_INDEX before setting errcbarg->indname=indname, then an > error would cause a crash. And if we pfree and set indname before phase, it'd > be a problem when going from an index phase to non-index phase. So maybe we > have to set errcbarg->phase=VACUUM_ERRCB_PHASE_UNKNOWN while in the function, > and errcbarg->phase=phase last. And addressed that. Also, I realized that lazy_cleanup_index has an early "return", so the "Revert back" was ineffective. We talked about how that wasn't needed, since we never go back to a previous phase. Amit wanted to keep it there for consistency, but I'd prefer to put any extra effort into calling out the special treatment needed/given to lazy_vacuum_heap/index, rather than making everything "consistent". Amit: I also moved the TRUNCATE_HEAP bit back to truncate_heap(), since 1) it's odd if we don't have anything in truncate_heap() about error reporting except for "vacrelstats->blkno = blkno"; and, 2) it's nice to set the err callback arg right after pgstat_progress, and outside of any loop. In previous versions, it was within the loop, because it closely wrapped RelationTruncate() and count_nondeletable_pages() - a previous version used separate phases. -- Justin >From bfc574979f85c0f0722d182ae8ae03097fb5f9c4 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Thu, 12 Dec 2019 20:54:37 -0600 Subject: [PATCH v36 1/3] Introduce vacuum errcontext to display additional information. The additional information displayed will be block number for error occurring while processing heap and index name for error occurring while processing the index. This will help us in diagnosing the problems that occur during a vacuum. For ex. due to corruption (either caused by bad hardware or by some bug) if we get some error while vacuuming, it can help us identify the block in heap and or additional index information. It sets up an error context callback to display additional information with the error. During different phases of vacuum (heap scan, heap vacuum, index vacuum, index clean up, heap truncate), we update the error context callback to display appropriate information. We can extend it to a bit more granular level like adding the phases for FSM operations or for prefetching the blocks while truncating. However, I felt that it requires adding many more error callback function calls and can make the code a bit complex, so left those for now. Author: Justin Pryzby, with few changes by Amit Kapila Reviewed-by: Alvaro Herrera, Amit Kapila, Andres Freund, Michael Paquier and Sawada Masahiko Discussion: https://www.postgresql.org/message-id/20191120210600.gc30...@telsasoft.com --- src/backend/access/heap/vacuumlazy.c | 240 --- src/tools/pgindent/typedefs.list | 1 + 2 files changed, 216 insertions(+), 25 deletions(-) diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index 03c43efc32..e98e6b45d3 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -144,6 +144,17 @@ */ #define ParallelVacuumIsActive(lps) PointerIsValid(lps) +/* Phases of vacuum during which we report error context. */ +typedef enum +{ + VACUUM_ERRCB_PHASE_UNKNOWN, + VACUUM_ERRCB_PHASE_SCAN_HEAP, + VACUUM_ERRCB_PHASE_VACUUM_INDEX, + VACUUM_ERRCB_PHASE_VACUUM_HEAP, + VACUUM_ERRCB_PHASE_INDEX_CLEANUP, + VACUUM_ERRCB_PHASE_TRUNCATE +} VacErrCbPhase; + /* * LVDeadTup
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 2020-Mar-26, Michael Paquier wrote: > I was looking at this patch again today and I am rather fine with the > existing semantics. Still I don't like much to name the frontend-side > routine FrontendRestoreArchivedFile() and use a different name than > the backend counterpart because we have to include xlog_internal.h in > fe_archive.c to be able to grab XLOGDIR. Uh, is XLOGDIR the only reason to include xlog_internal.h? Maybe it would be easier to have a routine in xlog.c that returns the path? There's a few functions in xlog.c that could use it, as well. Altough ... looking at xlog.h, that one is even worse, so I'm not sure *where* you'd put the prototype for the new function I'm proposing. > So here is an idea: let's move the declaration of the routines part of > xlogarchive.c to a new header, called xlogarchive.h, and then let's > use the same routine name for the frontend and the backend in this > second patch. We include xlog_internal.h already in many frontend > tools, so that would clean up things a bit. The patch downthread looks decent cleanup, but I'm not sure how it helps further the objective. (A really good cleanup could be a situation where frontend files don't need xlog_internal.h -- for example, maybe a new file xlogpage.h could contain struct defs that relate to page and segment headers and the like, as well as useful macros. I don't know if this can be made to work -- but xlog_internal.h contains stuff like xl_parameter_change etc as well as RmgrData which surely are of no interest to readers of wal files ... or, say, RequestXLogSwitch.) I don't think any such cleanup should hamper the patch at hand anyway. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pgbench - add \aset to store results of a combined query
Bonjour Michaël, [...] Still sounds strange to me to invent a new variable to this structure if it is possible to track the exact same thing with an existing part of a Command, or it would make sense to split Command into two different structures with an extra structure used after the parsing for clarity? Hmmm. Your point is to store the gset/aset status into the meta field, even if the command type is SQL. This is not done for gset, which relies on the non-null prefix, and breaks the assumption that meta is set to something only when the command is a meta command. Why not. I updated the comment, so now meta is none/gset/aset when command type is sql, and I removed the aset field. Well, it still looks cleaner to me to just assign the meta field properly within ParseScript(), and you get the same result. And it is also possible to use "meta" to do more sanity checks around META_GSET for some code paths. So I'd actually find the addition of a new argument using a meta command within readCommandResponse() cleaner. I tried to do that. - * varprefix SQL commands terminated with \gset have this set + * varprefix SQL commands terminated with \gset or \aset have this set Nit from v4: varprefix can be used for \gset and \aset, and the comment was not updated. It is now updated. + /* coldly skip empty result under \aset */ + if (ntuples <= 0) + break; Shouldn't this check after \aset? And it seems to me that this code path is not taken, so a test would be nice. Added (I think, if I understood what you suggested.). - } while (res); + } while (res != NULL); Useless diff. Yep. Attached an updated v5. -- Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index 41b3880c91..58a2aa3bf2 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -1057,18 +1057,29 @@ pgbench options d \gset [prefix] + \aset [prefix] - This command may be used to end SQL queries, taking the place of the + These commands may be used to end SQL queries, taking the place of the terminating semicolon (;). - When this command is used, the preceding SQL query is expected to - return one row, the columns of which are stored into variables named after - column names, and prefixed with prefix if provided. + When the \gset command is used, the preceding SQL query is + expected to return one row, the columns of which are stored into variables + named after column names, and prefixed with prefix + if provided. + + + + When the \aset command is used, all combined SQL queries + (separated by \;) have their columns stored into variables + named after column names, and prefixed with prefix + if provided. If a query returns no row, no assignment is made and the variable + can be tested for existence to detect this. If a query returns more than one + row, the last value is kept. @@ -1077,6 +1088,8 @@ pgbench options d p_two and p_three with integers from the third query. The result of the second query is discarded. + The result of the two last combined queries are stored in variables + four and five. UPDATE pgbench_accounts SET abalance = abalance + :delta @@ -1085,6 +1098,7 @@ UPDATE pgbench_accounts -- compound of two queries SELECT 1 \; SELECT 2 AS two, 3 AS three \gset p_ +SELECT 4 AS four \; SELECT 5 AS five \aset diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index b8864c6ae5..74aadeade0 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -480,6 +480,7 @@ typedef enum MetaCommand META_SHELL, /* \shell */ META_SLEEP, /* \sleep */ META_GSET, /* \gset */ + META_ASET, /* \aset */ META_IF, /* \if */ META_ELIF, /* \elif */ META_ELSE, /* \else */ @@ -504,14 +505,17 @@ static const char *QUERYMODE[] = {"simple", "extended", "prepared"}; *not applied. * first_line A short, single-line extract of 'lines', for error reporting. * type SQL_COMMAND or META_COMMAND - * meta The type of meta-command, or META_NONE if command is SQL + * meta The type of meta-command, if command is SQL META_NONE, + *META_GSET or META_ASET which dictate what to do with the + * SQL query result. * argc Number of arguments of the command, 0 if not yet processed. * argv Command arguments, the first of which is the command or SQL *string itself. For SQL commands, after post-processing *argv[0] is the same as 'lines' with variables substituted. - * varprefix SQL commands terminated with \gset have this set + * varprefix SQL commands terminated with \gset or \aset have this set *to a non NULL value. If nonempty, it's used to prefix the *variable name that receives the value. + * a
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 2020-03-26 10:34, Michael Paquier wrote: On Tue, Mar 24, 2020 at 12:22:16PM +0900, Michael Paquier wrote: Thanks Alvaro and Alexander. 0001 has been applied as of e09ad07. Now for 0002, let's see about it later. Attached is a rebased version, with no actual changes. I was looking at this patch again today and I am rather fine with the existing semantics. Still I don't like much to name the frontend-side routine FrontendRestoreArchivedFile() and use a different name than the backend counterpart because we have to include xlog_internal.h in fe_archive.c to be able to grab XLOGDIR. So here is an idea: let's move the declaration of the routines part of xlogarchive.c to a new header, called xlogarchive.h, and then let's use the same routine name for the frontend and the backend in this second patch. We include xlog_internal.h already in many frontend tools, so that would clean up things a bit. Two extra things are the routines for the checkpointer as well as the variables like ArchiveRecoveryRequested. It may be nice to move those while on it, but I am not sure where and that's not actually required for this patch set so that could be addressed later if need be. Any thoughts? The block of function declarations for xlogarchive.c inside xlog_internal.h looks a bit dangling already, since all other functions and variables defined/initialized in xlog.c. That way, it looks good to me to move it outside. The only one concern about using the same name I have is that later someone may introduce a new variable or typedef inside xlogarchive.h. So someone else would be required to include both fe_archive.h and xlogarchive.h at once. But probably there should be a warning in the header comment section against doing so. Anyway, I have tried to do what you proposed and attached is a small patch, that introduces xlogarchive.h. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Companydiff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 860a996414..d683af377f 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -36,6 +36,7 @@ #include "access/timeline.h" #include "access/xlog.h" +#include "access/xlogarchive.h" #include "access/xlog_internal.h" #include "access/xlogdefs.h" #include "pgstat.h" diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 7621fc05e2..242427f815 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -32,6 +32,7 @@ #include "access/transam.h" #include "access/twophase.h" #include "access/xact.h" +#include "access/xlogarchive.h" #include "access/xlog_internal.h" #include "access/xloginsert.h" #include "access/xlogreader.h" diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 914ad340ea..04104b55ea 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -20,6 +20,7 @@ #include #include "access/xlog.h" +#include "access/xlogarchive.h" #include "access/xlog_internal.h" #include "common/archive.h" #include "miscadmin.h" diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 25e0333c9e..cc91954ac1 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -48,6 +48,7 @@ #include "access/htup_details.h" #include "access/timeline.h" #include "access/transam.h" +#include "access/xlogarchive.h" #include "access/xlog_internal.h" #include "catalog/pg_authid.h" #include "catalog/pg_type.h" diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h index 27ded593ab..8e3cfcf83e 100644 --- a/src/include/access/xlog_internal.h +++ b/src/include/access/xlog_internal.h @@ -320,22 +320,4 @@ extern bool InArchiveRecovery; extern bool StandbyMode; extern char *recoveryRestoreCommand; -/* - * Prototypes for functions in xlogarchive.c - */ -extern bool RestoreArchivedFile(char *path, const char *xlogfname, -const char *recovername, off_t expectedSize, -bool cleanupEnabled); -extern void ExecuteRecoveryCommand(const char *command, const char *commandName, - bool failOnSignal); -extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname); -extern void XLogArchiveNotify(const char *xlog); -extern void XLogArchiveNotifySeg(XLogSegNo segno); -extern void XLogArchiveForceDone(const char *xlog); -extern bool XLogArchiveCheckDone(const char *xlog); -extern bool XLogArchiveIsBusy(const char *xlog); -extern bool XLogArchiveIsReady(const char *xlog); -extern bool XLogArchiveIsReadyOrDone(const char *xlog); -extern void XLogArchiveCleanup(const char *xlog); - #endif /* XLOG_INTERNAL_H */ diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h new file mode 100644 index 00..3406fb2800 --- /d
Re: Berserk Autovacuum (let's save next Mandrill)
On Fri, 27 Mar 2020 at 07:51, Andres Freund wrote: > > Hi, > > On 2020-03-26 10:12:39 +0100, Laurenz Albe wrote: > > On Wed, 2020-03-25 at 23:19 +0300, Alexander Korotkov wrote: > > I am reluctant to introduce new semantics like a reloption value of -2 > > to disable a feature in this patch right before feature freeze. > > > > I believe there are enough options to disable insert-only vacuuming for > > an individual table: > > > - Set the threshold to 2147483647. True, that will not work for very > > large tables, but I think that there are few tables that insert that > > many rows before they hit autovacuum_freeze_max_age anyway. > > > > - Set the scale factor to some astronomical value. > > Meh. You *are* adding new semantics with these. And they're terrible. I've modified this to allow a proper way to disable the entire feature by allowing the setting to be set to -1 to disable the feature. I feel people are fairly used to using -1 to disable various features (e.g. log_autovacuum_min_duration). I've used the special value of -2 for the reloption to have that cascade to using the GUC instead. The autovacuum_vacuum_insert_threshold reloption may be explicitly set to -1 to disable autovacuums for inserts for the relation. I've also knocked the default threshold down to 1000. I feel this is a better value given that the scale factor is now 0.2. There seemed to be no need to exclude smaller tables from seeing gains such as index-only scans. If nobody objects, I plan to push this one shortly. David insert_autovac_v12.patch Description: Binary data
Re: Ltree syntax improvement
On 25.03.2020 2:08, Tom Lane wrote: Nikita Glukhov writes: Attached new version of the patch. I spent a little bit of time looking through this, and have a few comments: * You have a lot of places where tests for particular ASCII characters are done like this: if ((charlen == 1) && t_iseq(src, '\\')) This is a tedious and expensive way to spell if (*src == '\\') because charlen is necessarily 1 if you are looking at an ASCII character; there is no allowed backend encoding in which an ASCII character can be the first byte of a multibyte character. Aside from the direct simplifications of the tests that this makes possible, I see some places where you'd not have to pass charlen around, either. All unnecessary checks of charlen were removed, but t_iseq() were left for consistency. * I spent a fair amount of time thinking that a lot of the added code was wrong because it was only considering escaping and not double-quoting. I eventually concluded that the idea is to convert double-quoting to a pure escape-based representation during input and store it that way. However, I don't really see why that is either necessary or a good idea --- the internal storage already has a length counter for each label. So I think what you really ought to be doing here is simplifying out both quotes and escapes during ltree_in and just storing the notionally-represented string internally. (If I've misunderstood what the plan is, well the utter lack of documentation in the patch isn't helping.) ltree_in() removes quotes and escapes before storing strings (see copy_unescaped()), just as you suggest. ltree_out() adds escapes and quotes if necessary (see copy_escaped(), extra_bytes_for_escaping()). I have refactored code a bit, removed duplicated code, fixed several bugs in reallocation of output strings, and added some comments. * The added test cases seem a bit excessive and repetitive. I have removed some tests that have become redundant after changes in parsing. -- Nikita Glukhov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company >From 8ec6f93203684d63bf3c5d006c2499a71a1a9dad Mon Sep 17 00:00:00 2001 From: Nikita Glukhov Date: Thu, 5 Mar 2020 17:34:59 +0300 Subject: [PATCH 1/2] Replace 'if' with 'switch' in ltree code --- contrib/ltree/ltree_io.c | 402 --- 1 file changed, 204 insertions(+), 198 deletions(-) diff --git a/contrib/ltree/ltree_io.c b/contrib/ltree/ltree_io.c index 900a46a..e97f035 100644 --- a/contrib/ltree/ltree_io.c +++ b/contrib/ltree/ltree_io.c @@ -69,40 +69,43 @@ ltree_in(PG_FUNCTION_ARGS) { charlen = pg_mblen(ptr); - if (state == LTPRS_WAITNAME) + switch (state) { - if (ISALNUM(ptr)) - { -lptr->start = ptr; -lptr->wlen = 0; -state = LTPRS_WAITDELIM; - } - else -UNCHAR; - } - else if (state == LTPRS_WAITDELIM) - { - if (charlen == 1 && t_iseq(ptr, '.')) - { -lptr->len = ptr - lptr->start; -if (lptr->wlen > 255) - ereport(ERROR, - (errcode(ERRCODE_NAME_TOO_LONG), - errmsg("name of level is too long"), - errdetail("Name length is %d, must " - "be < 256, in position %d.", - lptr->wlen, pos))); - -totallen += MAXALIGN(lptr->len + LEVEL_HDRSIZE); -lptr++; -state = LTPRS_WAITNAME; - } - else if (!ISALNUM(ptr)) -UNCHAR; + case LTPRS_WAITNAME: +if (ISALNUM(ptr)) +{ + lptr->start = ptr; + lptr->wlen = 0; + state = LTPRS_WAITDELIM; +} +else + UNCHAR; +break; + + case LTPRS_WAITDELIM: +if (charlen == 1 && t_iseq(ptr, '.')) +{ + lptr->len = ptr - lptr->start; + if (lptr->wlen > 255) + ereport(ERROR, +(errcode(ERRCODE_NAME_TOO_LONG), + errmsg("name of level is too long"), + errdetail("Name length is %d, must " + "be < 256, in position %d.", + lptr->wlen, pos))); + + totallen += MAXALIGN(lptr->len + LEVEL_HDRSIZE); + lptr++; + state = LTPRS_WAITNAME; +} +else if (!ISALNUM(ptr)) + UNCHAR; +break; + + default: +/* internal error */ +elog(ERROR, "internal error in parser"); } - else - /* internal error */ - elog(ERROR, "internal error in parser"); ptr += charlen; lptr->wlen++; @@ -238,178 +241,181 @@ lquery_in(PG_FUNCTION_ARGS) { charlen = pg_mblen(ptr); - if (state == LQPRS_WAITLEVEL) - { - if (ISALNUM(ptr)) - { -GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) * (numOR + 1)); -lptr->start = ptr; -state = LQPRS_WAITDELIM; -curqlevel->numvar = 1; - } - else if (charlen == 1 && t_iseq(ptr, '!')) - { -GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) * (numOR + 1)); -lptr->start = ptr + 1; -state = LQPRS_WAITDELIM; -curqlevel->numvar = 1; -curqlevel->flag |= LQL_NOT; -hasnot = true; - } - else if (charlen == 1 && t_iseq(ptr, '*')) -state =
Re: plan cache overhead on plpgsql expression
Andres Freund writes: > On 2020-03-26 14:37:59 -0400, Tom Lane wrote: >> Testing that reminded me of the other regression test failure I'd seen >> when I first tried to do it: select_parallel.sql shows a WARNING about >> a plancache leak in a parallel worker process. When I looked into the >> reason for that, it turned out that some cowboy has split >> XACT_EVENT_COMMIT into XACT_EVENT_COMMIT and >> XACT_EVENT_PARALLEL_COMMIT (where the latter is used in parallel >> workers) without bothering to fix the collateral damage to plpgsql. >> So plpgsql_xact_cb isn't doing any cleanup in parallel workers, and >> hasn't been for a couple of releases. >> The bad effects of that are probably limited given that the worker >> process will exit after committing, but I still think that that part >> of this patch is a bug fix that needs to be back-patched. > Ugh. Lucky that we don't register many things inside those resowners. Yeah. I spent some time trying to produce a failure this way, and concluded that it's pretty hard because most of the relevant callbacks will be run during ExprContext shutdown, which is done during plpgsql function exit. In a non-transaction-abort situation, the simple EState shouldn't have any live ExprContexts left at commit. I did find a case where a memory context callback attached to the EState's query context doesn't get run when expected ... but it still gets run later, when the whole memory context tree is destroyed. So I can't demonstrate any user-visible misbehavior in the core code. But it still seems like a prudent idea to back-patch a fix, in case I missed something or there is some extension that's pushing the boundaries further. It's definitely not very cool that we're leaving behind a dangling static pointer to an EState that won't exist once TopTransactionMemoryContext is gone. I'll back-patch relevant parts of those comments about DO block management, too. regards, tom lane
Re: backup manifests
Greetings, * Mark Dilger (mark.dil...@enterprisedb.com) wrote: > > On Mar 26, 2020, at 12:37 PM, Stephen Frost wrote: > > * Mark Dilger (mark.dil...@enterprisedb.com) wrote: > >>> On Mar 26, 2020, at 9:34 AM, Stephen Frost wrote: > >>> I'm not actually argueing about which hash functions we should support, > >>> but rather what the default is and if crc32c, specifically, is actually > >>> a reasonable choice. Just because it's fast and we already had an > >>> implementation of it doesn't justify its use as the default. Given that > >>> it doesn't actually provide the check that is generally expected of > >>> CRC checksums (100% detection of single-bit errors) when the file size > >>> gets over 512MB makes me wonder if we should have it at all, yes, but it > >>> definitely makes me think it shouldn't be our default. > >> > >> I don't understand your focus on the single-bit error issue. > > > > Maybe I'm wrong, but my understanding was that detecting single-bit > > errors was one of the primary design goals of CRC and why people talk > > about CRCs of certain sizes having 'limits'- that's the size at which > > single-bit errors will no longer, necessarily, be picked up and > > therefore that's where the CRC of that size starts falling down on that > > goal. > > I think I agree with all that. I'm not sure it is relevant. When people use > CRCs to detect things *other than* transmission errors, they are in some > sense using a hammer to drive a screw. At that point, the analysis of how > good the hammer is, and how big a nail it can drive, is no longer relevant. > The relevant discussion here is how appropriate a CRC is for our purpose. I > don't know the answer to that, but it doesn't seem the single-bit error > analysis is the right analysis. I disagree that it's not relevant- it's, in fact, the one really clear thing we can get a pretty straight-forward answer on, and that seems really useful to me. > >> If you are sending your backup across the wire, single bit errors during > >> transmission should already be detected as part of the networking > >> protocol. The real issue has to be detection of the kinds of errors or > >> modifications that are most likely to happen in practice. Which are > >> those? People manually mucking with the files? Bugs in backup scripts? > >> Corruption on the storage device? Truncated files? The more bits in the > >> checksum (assuming a well designed checksum algorithm), the more likely we > >> are to detect accidental modification, so it is no surprise if a 64-bit > >> crc does better than 32-bit crc. But that logic can be taken arbitrarily > >> far. I don't see the connection between, on the one hand, an analysis of > >> single-bit error detection against file size, and on the other hand, the > >> verification of backups. > > > > We'd like something that does a good job at detecting any differences > > between when the file was copied off of the server and when the command > > is run- potentially weeks or months later. I would expect most issues > > to end up being storage-level corruption over time where the backup is > > stored, which could be single bit flips or whole pages getting zeroed or > > various other things. Files changing size probably is one of the less > > common things, but, sure, that too. > > > > That we could take this "arbitrarily far" is actually entirely fine- > > that's a good reason to have alternatives, which this patch does have, > > but that doesn't mean we should have a default that's not suitable for > > the files that we know we're going to be storing. > > > > Consider that we could have used a 16-bit CRC instead, but does that > > actually make sense? Ok, sure, maybe someone really wants something > > super fast- but should that be our default? If not, then what criteria > > should we use for the default? > > I'll answer this below > > >> From a support perspective, I think the much more important issue is > >> making certain that checksums are turned on. A one in a billion chance of > >> missing an error seems pretty acceptable compared to the, let's say, one > >> in two chance that your customer didn't use checksums. Why are we even > >> allowing this to be turned off? Is there a usage case compelling that > >> option? > > > > The argument is that adding checksums takes more time. I can understand > > that argument, though I don't really agree with it. Certainly a few > > percent really shouldn't be that big of an issue, and in many cases even > > a sha256 hash isn't going to have that dramatic of an impact on the > > actual overall time. > > I see two dangers here: > > (1) The user enables checksums of some type, and due to checksums not being > perfect, corruption happens but goes undetected, leaving her in a bad place. > > (2) The user makes no checksum selection at all, gets checksums of the > *default* type, determines it is too slow for her purposes, and instead of
Re: NOT IN subquery optimization
>BTW, so far as I can see, the only reason you're bothering with the whole thing is to compare the size of the subquery output with work_mem, because that's all that subplan_is_hashable does. I wonder whether that consideration is even still necessary in the wake of 1f39bce02. If it is, I wonder whether there isn't a cheaper way to figure it out. (Note similar comment in make_subplan.) The comment in make_subplan says there is no cheaper way to figure out: /* At present, however, we can only check hashability after * we've made the subplan :-(. (Determining whether it'll fit in work_mem * is the really hard part.) */ I don't see why commit 1f39bce02 is related to this problem. Can you expand on this? >But can't you detect that case directly? It seems like you'd need to figure out the NULL situation anyway to know whether the transformation to antijoin is valid in the first place. Yes, we do need to figure out the NULL situation, and there is always valid transformation to antijoin, it's just in the NULL case we need to stuff additional clause to the anti join condition, and in these cases the transformation actually outperforms Subplan (non-hashed), but underperforms the hashed Subplan. The unmodified anti hash join has similar performance compared to hashed Subplan.
Re: backup manifests
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Thu, Mar 26, 2020 at 12:34 PM Stephen Frost wrote: > > I do agree with excluding things like md5 and others that aren't good > > options. I wasn't saying we should necessarily exclude crc32c either.. > > but rather saying that it shouldn't be the default. > > > > Here's another way to look at it- where do we use crc32c today, and how > > much data might we possibly be covering with that crc? > > WAL record size is a 32-bit unsigned integer, so in theory, up to 4GB > minus 1 byte. In practice, most of them are not more than a few > hundred bytes, the amount we might possibly be covering is a lot more. Is it actually possible, today, in PG, to have a 4GB WAL record? Judging this based on the WAL record size doesn't seem quite right. > > Why was crc32c > > picked for that purpose? > > Because it was discovered that 64-bit CRC was too slow, per commit > 21fda22ec46deb7734f793ef4d7fa6c226b4c78e. ... 15 years ago. I actually find it pretty interesting that we started out with a 64bit CRC there, I didn't know that was the case. Also interesting is that we had 64bit CRC code already. > > If the individual who decided to pick crc32c > > for that case was contemplating a checksum for up-to-1GB files, would > > they have picked crc32c? Seems unlikely to me. > > It's hard to be sure what someone who isn't us would have done in some > situation that they didn't face, but we do have the discussion thread: > > https://www.postgresql.org/message-id/flat/9291.1117593389%40sss.pgh.pa.us#c4e413bbf3d7fbeced7786da1c3aca9c > > The question of how much data is protected by the CRC was discussed, > mostly in the first few messages, in general terms, but it doesn't > seem to have covered the question very thoroughly. I'm sure we could > each draw things from that discussion that support our view of the > situation, but I'm not sure it would be very productive. Interesting. > What confuses to me is that you seem to have a view of the upsides and > downsides of these various algorithms that seems to me to be highly > skewed. Like, suppose we change the default from CRC-32C to > SHA-something. On the upside, the error detection rate will increase > from 99.999+% to something much closer to 100%. On the downside, > backups will get as much as 40-50% slower for some users. I hope we > can agree that both detecting errors and taking backups quickly are > important. However, it is hard for me to imagine that the typical user > would want to pay even a 5-10% performance penalty when taking a > backup in order to improve an error detection feature which they may > not even use and which already has less than a one-in-a-billion chance > of going wrong. We routinely reject features for causing, say, a 2% > regression on general workloads. Base backup speed is probably less > important than how many SELECT or INSERT queries you can pump through > the system in a second, but it's still a pain point for lots of > people. I think if you said to some users "hey, would you like to have > error detection for your backups? it'll cost 10%" many people would > say "yes, please." But I think if you went to the same users and said > "hey, would you like to make the error detection for your backups > better? it currently has a less than 1-in-a-billion chance of failing > to detect random corruption, and you can reduce that by many orders of > magnitude for an extra 10% on your backup time," I think the results > would be much more mixed. Some people would like it, but it certainly > not everybody. I think you're right that base backup speed is much less of an issue to slow down than SELECT or INSERT workloads, but I do also understand that it isn't completely unimportant, which is why having options isn't a bad idea here. That said, the options presented for users should all be reasonable options, and for the default we should pick something sensible, erroring on the "be safer" side, if anything. There's lots of options for speeding up base backups, with this patch, even if the default is to have a manifest with sha256 hashes- it could be changed to some form of CRC, or changed to not have checksums, or changed to not have a manifest. Users will have options. Again, I'm not against having a checksum algorithm as a option. I'm not saying that it must be SHA512 as the default. > > I'm not actually argueing about which hash functions we should support, > > but rather what the default is and if crc32c, specifically, is actually > > a reasonable choice. Just because it's fast and we already had an > > implementation of it doesn't justify its use as the default. Given that > > it doesn't actually provide the check that is generally expected of > > CRC checksums (100% detection of single-bit errors) when the file size > > gets over 512MB makes me wonder if we should have it at all, yes, but it > > definitely makes me think it shouldn't be our default. > > I mean, the pro
Re: backup manifests
> On Mar 26, 2020, at 12:37 PM, Stephen Frost wrote: > > Greetings, > > * Mark Dilger (mark.dil...@enterprisedb.com) wrote: >>> On Mar 26, 2020, at 9:34 AM, Stephen Frost wrote: >>> I'm not actually argueing about which hash functions we should support, >>> but rather what the default is and if crc32c, specifically, is actually >>> a reasonable choice. Just because it's fast and we already had an >>> implementation of it doesn't justify its use as the default. Given that >>> it doesn't actually provide the check that is generally expected of >>> CRC checksums (100% detection of single-bit errors) when the file size >>> gets over 512MB makes me wonder if we should have it at all, yes, but it >>> definitely makes me think it shouldn't be our default. >> >> I don't understand your focus on the single-bit error issue. > > Maybe I'm wrong, but my understanding was that detecting single-bit > errors was one of the primary design goals of CRC and why people talk > about CRCs of certain sizes having 'limits'- that's the size at which > single-bit errors will no longer, necessarily, be picked up and > therefore that's where the CRC of that size starts falling down on that > goal. I think I agree with all that. I'm not sure it is relevant. When people use CRCs to detect things *other than* transmission errors, they are in some sense using a hammer to drive a screw. At that point, the analysis of how good the hammer is, and how big a nail it can drive, is no longer relevant. The relevant discussion here is how appropriate a CRC is for our purpose. I don't know the answer to that, but it doesn't seem the single-bit error analysis is the right analysis. >> If you are sending your backup across the wire, single bit errors during >> transmission should already be detected as part of the networking protocol. >> The real issue has to be detection of the kinds of errors or modifications >> that are most likely to happen in practice. Which are those? People >> manually mucking with the files? Bugs in backup scripts? Corruption on the >> storage device? Truncated files? The more bits in the checksum (assuming a >> well designed checksum algorithm), the more likely we are to detect >> accidental modification, so it is no surprise if a 64-bit crc does better >> than 32-bit crc. But that logic can be taken arbitrarily far. I don't see >> the connection between, on the one hand, an analysis of single-bit error >> detection against file size, and on the other hand, the verification of >> backups. > > We'd like something that does a good job at detecting any differences > between when the file was copied off of the server and when the command > is run- potentially weeks or months later. I would expect most issues > to end up being storage-level corruption over time where the backup is > stored, which could be single bit flips or whole pages getting zeroed or > various other things. Files changing size probably is one of the less > common things, but, sure, that too. > > That we could take this "arbitrarily far" is actually entirely fine- > that's a good reason to have alternatives, which this patch does have, > but that doesn't mean we should have a default that's not suitable for > the files that we know we're going to be storing. > > Consider that we could have used a 16-bit CRC instead, but does that > actually make sense? Ok, sure, maybe someone really wants something > super fast- but should that be our default? If not, then what criteria > should we use for the default? I'll answer this below >> From a support perspective, I think the much more important issue is making >> certain that checksums are turned on. A one in a billion chance of missing >> an error seems pretty acceptable compared to the, let's say, one in two >> chance that your customer didn't use checksums. Why are we even allowing >> this to be turned off? Is there a usage case compelling that option? > > The argument is that adding checksums takes more time. I can understand > that argument, though I don't really agree with it. Certainly a few > percent really shouldn't be that big of an issue, and in many cases even > a sha256 hash isn't going to have that dramatic of an impact on the > actual overall time. I see two dangers here: (1) The user enables checksums of some type, and due to checksums not being perfect, corruption happens but goes undetected, leaving her in a bad place. (2) The user makes no checksum selection at all, gets checksums of the *default* type, determines it is too slow for her purposes, and instead of adjusting the checksum algorithm to something faster, simply turns checksums off; corruption happens and of course is undetected, leaving her in a bad place. I think the risk of (2) is far worse, which makes me tend towards a default that is fast enough not to encourage anybody to disable checksums altogether. I have no opinion about which algorithm is best suit
Re: backup manifests
On 3/26/20 11:37 AM, Robert Haas wrote: On Wed, Mar 25, 2020 at 4:54 PM Stephen Frost wrot > This is where I feel like I'm trying to make decisions in a vacuum. If we had a few more people weighing in on the thread on this point, I'd be happy to go with whatever the consensus was. If most people think having both --no-manifest (suppressing the manifest completely) and --manifest-checksums=none (suppressing only the checksums) is useless and confusing, then sure, let's rip the latter one out. If most people like the flexibility, let's keep it: it's already implemented and tested. But I hate to base the decision on what one or two people think. I'm not sure I see a lot of value to being able to build manifest with no checksums, especially if overhead for the default checksum algorithm is negligible. However, I'd still prefer that the default be something more robust and allow users to tune it down rather than the other way around. But I've made that pretty clear up-thread and I consider that argument lost at this point. As for folks who are that close to the edge on their backup timing that they can't have it slow down- chances are pretty darn good that they're not far from ending up needing to find a better solution than pg_basebackup anyway. Or they don't need to generate a manifest (or, I suppose, they could have one but not have checksums..). 40-50% is a lot more than "if you were on the edge." For the record I think this is a very misleading number. Sure, if you are doing your backup to a local SSD on a powerful development laptop it makes sense. But backups are generally placed on slower storage, remotely, with compression. Even without compression the first two are going to bring this percentage down by a lot. When you get to page-level incremental backups, which is where this all started, I'd still recommend using a stronger checksum algorithm to verify that the file was reconstructed correctly on restore. That much I believe we have agreed on. Even pg_basebackup (in both fetch and stream modes...) checks that we at least got all the WAL that's needed for the backup from the server before considering the backup to be valid and telling the user that there was a successful backup. With what you're proposing here, we could have someone do a pg_basebackup, get back an ERROR saying the backup wasn't valid, and then run pg_validatebackup and be told that the backup is valid. I don't get how that's sensible. I'm sorry that you can't see how that's sensible, but it doesn't mean that it isn't sensible. It is totally unrealistic to expect that any backup verification tool can verify that you won't get an error when trying to use the backup. That would require that everything that the validation tool try to do everything that PostgreSQL will try to do when the backup is used, including running recovery and updating the data files. Anything less than that creates a real possibility that the backup will verify good but fail when used. This tool has a much narrower purpose, which is to try to verify that we (still) have the files the server sent as part of the backup and that, to the best of our ability to detect such things, they have not been modified. As you know, or should know, the WAL files are not sent as part of the backup, and so are not verified. Other things that would also be useful to check are also not verified. It would be fantastic to have more verification tools in the future, but it is difficult to see why anyone would bother trying if an attempt to get the first one committed gets blocked because it does not yet do everything. Very few patches try to do everything, and those that do usually get blocked because, by trying to do too much, they get some of it badly wrong. I agree with Stephen that this should be done, but I agree with you that it can wait for a future commit. However, I do think: 1) It should be called out rather plainly in the documentation. 2) If there are files in pg_wal then pg_validatebackup should inform the user that those files have not been validated. I know you and Stephen have agreed on a number of doc changes, would it be possible to get a new patch with those included? I finally have time to do a review of this tomorrow. I saw some mistakes in the docs in the current patch but I know those patches are not current. Regards, -- -David da...@pgmasters.net
Re: backup manifests
Greetings, * Mark Dilger (mark.dil...@enterprisedb.com) wrote: > > On Mar 26, 2020, at 9:34 AM, Stephen Frost wrote: > > I'm not actually argueing about which hash functions we should support, > > but rather what the default is and if crc32c, specifically, is actually > > a reasonable choice. Just because it's fast and we already had an > > implementation of it doesn't justify its use as the default. Given that > > it doesn't actually provide the check that is generally expected of > > CRC checksums (100% detection of single-bit errors) when the file size > > gets over 512MB makes me wonder if we should have it at all, yes, but it > > definitely makes me think it shouldn't be our default. > > I don't understand your focus on the single-bit error issue. Maybe I'm wrong, but my understanding was that detecting single-bit errors was one of the primary design goals of CRC and why people talk about CRCs of certain sizes having 'limits'- that's the size at which single-bit errors will no longer, necessarily, be picked up and therefore that's where the CRC of that size starts falling down on that goal. > If you are sending your backup across the wire, single bit errors during > transmission should already be detected as part of the networking protocol. > The real issue has to be detection of the kinds of errors or modifications > that are most likely to happen in practice. Which are those? People > manually mucking with the files? Bugs in backup scripts? Corruption on the > storage device? Truncated files? The more bits in the checksum (assuming a > well designed checksum algorithm), the more likely we are to detect > accidental modification, so it is no surprise if a 64-bit crc does better > than 32-bit crc. But that logic can be taken arbitrarily far. I don't see > the connection between, on the one hand, an analysis of single-bit error > detection against file size, and on the other hand, the verification of > backups. We'd like something that does a good job at detecting any differences between when the file was copied off of the server and when the command is run- potentially weeks or months later. I would expect most issues to end up being storage-level corruption over time where the backup is stored, which could be single bit flips or whole pages getting zeroed or various other things. Files changing size probably is one of the less common things, but, sure, that too. That we could take this "arbitrarily far" is actually entirely fine- that's a good reason to have alternatives, which this patch does have, but that doesn't mean we should have a default that's not suitable for the files that we know we're going to be storing. Consider that we could have used a 16-bit CRC instead, but does that actually make sense? Ok, sure, maybe someone really wants something super fast- but should that be our default? If not, then what criteria should we use for the default? > From a support perspective, I think the much more important issue is making > certain that checksums are turned on. A one in a billion chance of missing > an error seems pretty acceptable compared to the, let's say, one in two > chance that your customer didn't use checksums. Why are we even allowing > this to be turned off? Is there a usage case compelling that option? The argument is that adding checksums takes more time. I can understand that argument, though I don't really agree with it. Certainly a few percent really shouldn't be that big of an issue, and in many cases even a sha256 hash isn't going to have that dramatic of an impact on the actual overall time. Thanks, Stephen signature.asc Description: PGP signature
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-03-26 21:01, Justin Pryzby wrote: @@ -262,7 +280,7 @@ cluster(ClusterStmt *stmt, bool isTopLevel) * and error messages should refer to the operation as VACUUM not CLUSTER. */ void -cluster_rel(Oid tableOid, Oid indexOid, int options) +cluster_rel(Oid tableOid, Oid indexOid, Oid tablespaceOid, int options) Add a comment here about the tablespaceOid parameter, like the other functions where it's added. The permission checking is kind of duplicitive, so I'd suggest to factor it out. Ideally we'd only have one place that checks for pg_global/system/mapped. It needs to check that it's not a system relation, or that system_table_mods are allowed, and in any case that if it's a mapped rel, that it's not being moved. I would pass a boolean indicating if the tablespace is being changed. Yes, but I wanted to make sure first that all necessary validations are there to do not miss something as I did last time. I do not like repetitive code either, so I would like to introduce more common check after reviewing the code as a whole. Another issue is this: +VACUUM ( FULL [, ...] ) [ TABLESPACE class="parameter">new_tablespace ] [ class="parameter">table_and_columns [, ...] ] As you mentioned in your v1 patch, in the other cases, "tablespace [tablespace]" is added at the end of the command rather than in the middle. I wasn't able to make that work, maybe because "tablespace" isn't a fully reserved word (?). I didn't try with "SET TABLESPACE", although I understand it'd be better without "SET". Initially I tried "SET TABLESPACE", but also failed to completely get rid of shift/reduce conflicts. I will try to rewrite VACUUM's part again with OptTableSpace. Maybe I will manage it this time. I will take into account all your text edits as well. Thanks -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: plan cache overhead on plpgsql expression
Andres Freund writes: > On 2020-03-26 14:37:59 -0400, Tom Lane wrote: >> + * This function, together with CachedPlanIsSimplyValid, provides a fast >> path >> + * for revalidating "simple" generic plans. The core requirement to be >> simple >> + * is that the plan must not require taking any locks, which translates to >> + * not touching any tables; this happens to match up well with an important >> + * use-case in PL/pgSQL. > Hm - is there currently sufficient guarantee that we absorb sinval > messages? Would still matter for types, functions, etc? There are potentially issues of that sort throughout the backend, not just here, since we don't have any locking on types or functions. I don't think it's this patch's job to address that. In practice I think we've thought about it and concluded that the cost/benefit of introducing such locks just isn't promising: * Generally speaking you can't do anything very interesting to a type anyway, at least not with supported DDL. The worst-case situation that could materialize AFAIK is possibly evaluating slightly-stale constraints for a domain. (The typcache does have sinval invalidation for those constraints, but I don't recall offhand how much we guarantee about how quickly we'll notice updates.) * For functions, you might execute a somewhat stale version of the function body. The bad effects there are pretty limited since a function is defined by just one catalog row, unlike tables, so you can't see a self-inconsistent version of it. The amount of lock overhead that it would take to remove those edge cases seems slightly staggering, so I doubt we'd ever do it. > While it'd do a small bit unnecessary work, I do wonder if it'd be > better to use this code in ResourceOwnereReleaseInternal(). When and if we refactor to expose this sort of thing more generally, it might be worth doing it like that. I don't think it helps much right now. > Perhaps add a reference to the new (appreciated, btw) DO comment above? Can do. Again, thanks for reviewing! regards, tom lane
Re: Berserk Autovacuum (let's save next Mandrill)
Hi, On 2020-03-26 10:12:39 +0100, Laurenz Albe wrote: > On Wed, 2020-03-25 at 23:19 +0300, Alexander Korotkov wrote: > I am reluctant to introduce new semantics like a reloption value of -2 > to disable a feature in this patch right before feature freeze. > > I believe there are enough options to disable insert-only vacuuming for > an individual table: > - Set the threshold to 2147483647. True, that will not work for very > large tables, but I think that there are few tables that insert that > many rows before they hit autovacuum_freeze_max_age anyway. > > - Set the scale factor to some astronomical value. Meh. You *are* adding new semantics with these. And they're terrible. Greetings, Andres Freund
Re: plan cache overhead on plpgsql expression
Hi, On 2020-03-26 14:37:59 -0400, Tom Lane wrote: > I wrote: > > I had a thought about a possibly-cleaner way to do this. We could invent > > a resowner function, say ResourceOwnerReleaseAllPlanCacheRefs, that > > explicitly releases all plancache pins it knows about. So plpgsql > > would not call the regular ResourceOwnerRelease entry point at all, > > but just call that and then ResourceOwnerDelete, again relying on the > > assertions therein to catch anything not released. > > Here's a version that does it like that. This does seem marginally > nicer than the other way. I have a feeling that at some point we'll > want to expose resowners' contents more generally, but I'm not quite > sure what the use-cases will be, so I don't want to design that now. Yea, agreed with all of what you said in that paragraph. > Testing that reminded me of the other regression test failure I'd seen > when I first tried to do it: select_parallel.sql shows a WARNING about > a plancache leak in a parallel worker process. When I looked into the > reason for that, it turned out that some cowboy has split > XACT_EVENT_COMMIT into XACT_EVENT_COMMIT and > XACT_EVENT_PARALLEL_COMMIT (where the latter is used in parallel > workers) without bothering to fix the collateral damage to plpgsql. > So plpgsql_xact_cb isn't doing any cleanup in parallel workers, and > hasn't been for a couple of releases. Ugh. > The bad effects of that are probably limited given that the worker > process will exit after committing, but I still think that that part > of this patch is a bug fix that needs to be back-patched. Ugh. Lucky that we don't register many things inside those resowners. > (Just > looking at what FreeExecutorState does, I wonder whether > jit_release_context has any side-effects that are visible outside the > process? But I bet I can make a test case that shows issues even > without JIT, based on the failure to call ExprContext shutdown > callbacks.) JIT doesn't currently have side-effects outside of the process. I really want to add caching support, which'd presumably have problems due to this, but it's not there yet... This could lead to leaking a fair bit of memory over time otherwise. > /* > + * CachedPlanAllowsSimpleValidityCheck: can we use CachedPlanIsSimplyValid? > + * > + * This function, together with CachedPlanIsSimplyValid, provides a fast path > + * for revalidating "simple" generic plans. The core requirement to be > simple > + * is that the plan must not require taking any locks, which translates to > + * not touching any tables; this happens to match up well with an important > + * use-case in PL/pgSQL. Hm - is there currently sufficient guarantee that we absorb sinval messages? Would still matter for types, functions, etc? > /* > + * ResourceOwnerReleaseAllPlanCacheRefs > + * Release the plancache references (only) held by this owner. > + * > + * We might eventually add similar functions for other resource types, > + * but for now, only this is needed. > + */ > +void > +ResourceOwnerReleaseAllPlanCacheRefs(ResourceOwner owner) > +{ > + ResourceOwner save; > + Datum foundres; > + > + save = CurrentResourceOwner; > + CurrentResourceOwner = owner; > + while (ResourceArrayGetAny(&(owner->planrefarr), &foundres)) > + { > + CachedPlan *res = (CachedPlan *) DatumGetPointer(foundres); > + > + ReleaseCachedPlan(res, true); > + } > + CurrentResourceOwner = save; > +} While it'd do a small bit unnecessary work, I do wonder if it'd be better to use this code in ResourceOwnereReleaseInternal(). > --- a/src/pl/plpgsql/src/pl_exec.c > +++ b/src/pl/plpgsql/src/pl_exec.c > @@ -84,6 +84,13 @@ typedef struct > * has its own simple-expression EState, which is cleaned up at exit from > * plpgsql_inline_handler(). DO blocks still use the simple_econtext_stack, > * though, so that subxact abort cleanup does the right thing. > + * > + * (However, if a DO block executes COMMIT or ROLLBACK, then exec_stmt_commit > + * or exec_stmt_rollback will unlink it from the DO's simple-expression > EState > + * and create a new shared EState that will be used thenceforth. The > original > + * EState will be cleaned up when we get back to plpgsql_inline_handler. > This > + * is a bit ugly, but it isn't worth doing better, since scenarios like this > + * can't result in indefinite accumulation of state trees.) > */ > typedef struct SimpleEcontextStackEntry > { > @@ -96,6 +103,15 @@ static EState *shared_simple_eval_estate = NULL; > static SimpleEcontextStackEntry *simple_econtext_stack = NULL; > > /* > + * In addition to the shared simple-eval EState, we have a shared resource > + * owner that holds refcounts on the CachedPlans for any "simple" expressions > + * we have evaluated in the current transaction. This allows us to avoid > + * continually grabbing and releasing a plan refcount when a simple > expression > + * is used over
Re: proposal \gcsv
On 2020-03-26 18:49, Pavel Stehule wrote: Hi [psql-gfmt.patch] This seems useful and works well; I haven't found any errors. Well done. However, I have a suggestion that is perhaps slightly outside of this patch but functionally so close that maybe we can discuss it here. When you try to get a tab-separated output via this new \gfmt in a one-liner you're still forced to use \pset csv_fieldsep '\t' Would it be possible to do one of the following to enable a more compact one-liner syntax: 1. add an option: \gfmt tsv --> use a TAB instead of a comma in the csv or 2. let the psql command-line option '--csv' honour the value given by psql -F/--field-separator (it does not do so now) or 3. add an psql -commandline option: --csv-field-separator Any of these three (I'd prefer the first) would make producing a tsv in shell one-liners with psql easier/more compact. Thanks, Erik Rijkers
Re: plan cache overhead on plpgsql expression
I wrote: > I had a thought about a possibly-cleaner way to do this. We could invent > a resowner function, say ResourceOwnerReleaseAllPlanCacheRefs, that > explicitly releases all plancache pins it knows about. So plpgsql > would not call the regular ResourceOwnerRelease entry point at all, > but just call that and then ResourceOwnerDelete, again relying on the > assertions therein to catch anything not released. Here's a version that does it like that. This does seem marginally nicer than the other way. I have a feeling that at some point we'll want to expose resowners' contents more generally, but I'm not quite sure what the use-cases will be, so I don't want to design that now. Also, I studied the question of DO blocks' eval_estate + resowner more carefully, and eventually concluded that the way it's being done is okay --- it doesn't leak memory, as I'd first suspected. But it's surely underdocumented, so I added some comments about it. I also concluded as part of that study that it's probably best if we *do* make the resowner parentage different in the two cases after all. So this has the "shared" resowner as a child of TopTransactionResourceOwner after all (which means we don't need to delete it explicitly), while a DO block's private resowner is standalone (so it needs an explicit deletion). Testing that reminded me of the other regression test failure I'd seen when I first tried to do it: select_parallel.sql shows a WARNING about a plancache leak in a parallel worker process. When I looked into the reason for that, it turned out that some cowboy has split XACT_EVENT_COMMIT into XACT_EVENT_COMMIT and XACT_EVENT_PARALLEL_COMMIT (where the latter is used in parallel workers) without bothering to fix the collateral damage to plpgsql. So plpgsql_xact_cb isn't doing any cleanup in parallel workers, and hasn't been for a couple of releases. The bad effects of that are probably limited given that the worker process will exit after committing, but I still think that that part of this patch is a bug fix that needs to be back-patched. (Just looking at what FreeExecutorState does, I wonder whether jit_release_context has any side-effects that are visible outside the process? But I bet I can make a test case that shows issues even without JIT, based on the failure to call ExprContext shutdown callbacks.) Anyway, I think this is committable now. regards, tom lane diff --git a/src/pl/plpgsql/src/Makefile b/src/pl/plpgsql/src/Makefile index 70a9c34..193df8a 100644 --- a/src/pl/plpgsql/src/Makefile +++ b/src/pl/plpgsql/src/Makefile @@ -33,8 +33,8 @@ DATA = plpgsql.control plpgsql--1.0.sql REGRESS_OPTS = --dbname=$(PL_TESTDB) REGRESS = plpgsql_call plpgsql_control plpgsql_copy plpgsql_domain \ - plpgsql_record plpgsql_cache plpgsql_transaction plpgsql_trap \ - plpgsql_trigger plpgsql_varprops + plpgsql_record plpgsql_cache plpgsql_simple plpgsql_transaction \ + plpgsql_trap plpgsql_trigger plpgsql_varprops # where to find gen_keywordlist.pl and subsidiary files TOOLSDIR = $(top_srcdir)/src/tools diff --git a/src/pl/plpgsql/src/expected/plpgsql_simple.out b/src/pl/plpgsql/src/expected/plpgsql_simple.out new file mode 100644 index 000..5a2fefa --- /dev/null +++ b/src/pl/plpgsql/src/expected/plpgsql_simple.out @@ -0,0 +1,68 @@ +-- +-- Tests for plpgsql's handling of "simple" expressions +-- +-- Check that changes to an inline-able function are handled correctly +create function simplesql(int) returns int language sql +as 'select $1'; +create function simplecaller() returns int language plpgsql +as $$ +declare + sum int := 0; +begin + for n in 1..10 loop +sum := sum + simplesql(n); +if n = 5 then + create or replace function simplesql(int) returns int language sql + as 'select $1 + 100'; +end if; + end loop; + return sum; +end$$; +select simplecaller(); + simplecaller +-- + 555 +(1 row) + +-- Check that changes in search path are dealt with correctly +create schema simple1; +create function simple1.simpletarget(int) returns int language plpgsql +as $$begin return $1; end$$; +create function simpletarget(int) returns int language plpgsql +as $$begin return $1 + 100; end$$; +create or replace function simplecaller() returns int language plpgsql +as $$ +declare + sum int := 0; +begin + for n in 1..10 loop +sum := sum + simpletarget(n); +if n = 5 then + set local search_path = 'simple1'; +end if; + end loop; + return sum; +end$$; +select simplecaller(); + simplecaller +-- + 555 +(1 row) + +-- try it with non-volatile functions, too +alter function simple1.simpletarget(int) immutable; +alter function simpletarget(int) immutable; +select simplecaller(); + simplecaller +-- + 555 +(1 row) + +-- make sure flushing local caches changes nothing +\c - +select simplecaller(); + simplecaller +-- + 555 +(1 row) + diff --git a/src/pl/plpgsql/src/sql/plpg
Re: backup manifests
On Thu, Mar 26, 2020 at 12:34 PM Stephen Frost wrote: > I do agree with excluding things like md5 and others that aren't good > options. I wasn't saying we should necessarily exclude crc32c either.. > but rather saying that it shouldn't be the default. > > Here's another way to look at it- where do we use crc32c today, and how > much data might we possibly be covering with that crc? WAL record size is a 32-bit unsigned integer, so in theory, up to 4GB minus 1 byte. In practice, most of them are not more than a few hundred bytes, the amount we might possibly be covering is a lot more. > Why was crc32c > picked for that purpose? Because it was discovered that 64-bit CRC was too slow, per commit 21fda22ec46deb7734f793ef4d7fa6c226b4c78e. > If the individual who decided to pick crc32c > for that case was contemplating a checksum for up-to-1GB files, would > they have picked crc32c? Seems unlikely to me. It's hard to be sure what someone who isn't us would have done in some situation that they didn't face, but we do have the discussion thread: https://www.postgresql.org/message-id/flat/9291.1117593389%40sss.pgh.pa.us#c4e413bbf3d7fbeced7786da1c3aca9c The question of how much data is protected by the CRC was discussed, mostly in the first few messages, in general terms, but it doesn't seem to have covered the question very thoroughly. I'm sure we could each draw things from that discussion that support our view of the situation, but I'm not sure it would be very productive. What confuses to me is that you seem to have a view of the upsides and downsides of these various algorithms that seems to me to be highly skewed. Like, suppose we change the default from CRC-32C to SHA-something. On the upside, the error detection rate will increase from 99.999+% to something much closer to 100%. On the downside, backups will get as much as 40-50% slower for some users. I hope we can agree that both detecting errors and taking backups quickly are important. However, it is hard for me to imagine that the typical user would want to pay even a 5-10% performance penalty when taking a backup in order to improve an error detection feature which they may not even use and which already has less than a one-in-a-billion chance of going wrong. We routinely reject features for causing, say, a 2% regression on general workloads. Base backup speed is probably less important than how many SELECT or INSERT queries you can pump through the system in a second, but it's still a pain point for lots of people. I think if you said to some users "hey, would you like to have error detection for your backups? it'll cost 10%" many people would say "yes, please." But I think if you went to the same users and said "hey, would you like to make the error detection for your backups better? it currently has a less than 1-in-a-billion chance of failing to detect random corruption, and you can reduce that by many orders of magnitude for an extra 10% on your backup time," I think the results would be much more mixed. Some people would like it, but it certainly not everybody. > I'm not actually argueing about which hash functions we should support, > but rather what the default is and if crc32c, specifically, is actually > a reasonable choice. Just because it's fast and we already had an > implementation of it doesn't justify its use as the default. Given that > it doesn't actually provide the check that is generally expected of > CRC checksums (100% detection of single-bit errors) when the file size > gets over 512MB makes me wonder if we should have it at all, yes, but it > definitely makes me think it shouldn't be our default. I mean, the property that I care about is the one where it detects better than 999,999,999 errors out of every 1,000,000,000, regardless of input length. > I don't agree with limiting our view to only those algorithms that we've > already got implemented in PG. I mean, opening that giant can of worms ~2 weeks before feature freeze is not very nice. This patch has been around for months, and the algorithms were openly discussed a long time ago. I checked and found out that the CRC-64 code was nuked in commit 404bc51cde9dce1c674abe4695635612f08fe27e, so in theory we could revert that, but how much confidence do we have that the code in question actually did the right thing, or that it's actually fast? An awful lot of work has been done on the CRC-32C code over the years, including several rounds of speeding it up (f044d71e331d77a0039cec0a11859b5a3c72bc95, 3dc2d62d0486325bf263655c2d9a96aee0b02abe) and one round of fixing it because it was producing completely wrong answers (5028f22f6eb0579890689655285a4778b4ffc460), so I don't have a lot of confidence about that CRC-64 code being totally without problems. The commit message for that last commit, 5028f22f6eb0579890689655285a4778b4ffc460, seems pretty relevant in this context, too. It observes that, because it "does not correspond to any bit-wise CRC calculation" it is
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
> I included your new solution regarding this part from 0004 into 0001. It > seems that at least a tip of the problem was in that we tried to change > tablespace to pg_default being already there. Right, causing it to try to drop that filenode twice. > +++ b/doc/src/sgml/ref/cluster.sgml > + The name of a specific tablespace to store clustered relations. Could you phrase these like you did in the comments: " the name of the tablespace where the clustered relation is to be rebuilt." > +++ b/doc/src/sgml/ref/reindex.sgml > + The name of a specific tablespace to store rebuilt indexes. " The name of a tablespace where indexes will be rebuilt" > +++ b/doc/src/sgml/ref/vacuum.sgml > + The name of a specific tablespace to write a new copy of the table. > + This specifies a tablespace, where all rebuilt indexes will be created. say "specifies the tablespace where", with no comma. > + else if (!OidIsValid(classtuple->relfilenode)) > + { > + /* > + * Skip all mapped relations. > + * relfilenode == 0 checks after that, > similarly to > + * RelationIsMapped(). I would say "OidIsValid(relfilenode) checks for that, ..." > @@ -262,7 +280,7 @@ cluster(ClusterStmt *stmt, bool isTopLevel) > * and error messages should refer to the operation as VACUUM not CLUSTER. > */ > void > -cluster_rel(Oid tableOid, Oid indexOid, int options) > +cluster_rel(Oid tableOid, Oid indexOid, Oid tablespaceOid, int options) Add a comment here about the tablespaceOid parameter, like the other functions where it's added. The permission checking is kind of duplicitive, so I'd suggest to factor it out. Ideally we'd only have one place that checks for pg_global/system/mapped. It needs to check that it's not a system relation, or that system_table_mods are allowed, and in any case that if it's a mapped rel, that it's not being moved. I would pass a boolean indicating if the tablespace is being changed. Another issue is this: > +VACUUM ( FULL [, ...] ) [ TABLESPACE class="parameter">new_tablespace ] [ class="parameter">table_and_columns [, ...] ] As you mentioned in your v1 patch, in the other cases, "tablespace [tablespace]" is added at the end of the command rather than in the middle. I wasn't able to make that work, maybe because "tablespace" isn't a fully reserved word (?). I didn't try with "SET TABLESPACE", although I understand it'd be better without "SET". -- Justin
Re: proposal \gcsv
čt 26. 3. 2020 v 18:55 odesílatel Vik Fearing napsal: > On 3/26/20 10:49 AM, Pavel Stehule wrote: > > Hi > > > > čt 26. 3. 2020 v 17:45 odesílatel Vik Fearing > > napsal: > > > >> After some opinions on the first issue and fixing the second, I think > >> this is good to be committed. > >> > > > > Thank you for review > > This patch now looks good to me. Marking as Ready for Committer. > Thank you very much Pavel -- > Vik Fearing >
Re: proposal \gcsv
On 3/26/20 10:49 AM, Pavel Stehule wrote: > Hi > > čt 26. 3. 2020 v 17:45 odesílatel Vik Fearing > napsal: > >> After some opinions on the first issue and fixing the second, I think >> this is good to be committed. >> > > Thank you for review This patch now looks good to me. Marking as Ready for Committer. -- Vik Fearing
Re: proposal \gcsv
Hi čt 26. 3. 2020 v 17:45 odesílatel Vik Fearing napsal: > On 3/24/20 3:02 AM, Pavel Stehule wrote: > > Hi > > > > rebase > > Thank you, Pavel. > > I have now had time to review it, and it looks good to me except for two > issues. > > The first is, even though I suggested gf, I think it should actually be > gfmt. There may be something else in the future that starts with f and > we shouldn't close ourselves off to it. > renamed to \gfmt > The second is tab completion doesn't work for the second argument. > Adding the following fixes that: > > diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c > index ed6945a7f12..9d8cf442972 100644 > --- a/src/bin/psql/tab-complete.c > +++ b/src/bin/psql/tab-complete.c > @@ -3786,6 +3786,12 @@ psql_completion(const char *text, int start, int > end) > COMPLETE_WITH_CS("aligned", "asciidoc", "csv", "html", > "latex", > "latex-longtable", > "troff-ms", "unaligned", > "wrapped"); > + else if (TailMatchesCS("\\gf", MatchAny)) > + { > + completion_charp = "\\"; > + completion_force_quote = false; > + matches = rl_completion_matches(text, complete_from_files); > + } > > else if (TailMatchesCS("\\h|\\help")) > COMPLETE_WITH_LIST(sql_commands); > > merged > After some opinions on the first issue and fixing the second, I think > this is good to be committed. > Thank you for review Pavel -- > Vik Fearing > diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index 402c4b1b26..ba480c7bb6 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -2199,6 +2199,16 @@ CREATE INDEX + +\gfmt format [ filename ] +\gfmt format [ |command ] + + +\gfmt is equivalent to \g, but +forces specified format. See \pset format. + + + \gset [ prefix ] diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index abb18a19c2..d724a5eb16 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@ -163,6 +163,8 @@ static void printGSSInfo(void); static bool printPsetInfo(const char *param, struct printQueryOpt *popt); static char *pset_value_string(const char *param, struct printQueryOpt *popt); +static bool format_number(const char *value, int vallen, enum printFormat *numformat); + #ifdef WIN32 static void checkWin32Codepage(void); #endif @@ -333,7 +335,8 @@ exec_command(const char *cmd, status = exec_command_errverbose(scan_state, active_branch); else if (strcmp(cmd, "f") == 0) status = exec_command_f(scan_state, active_branch); - else if (strcmp(cmd, "g") == 0 || strcmp(cmd, "gx") == 0) + else if (strcmp(cmd, "g") == 0 || strcmp(cmd, "gx") == 0 || + strcmp(cmd, "gfmt") == 0) status = exec_command_g(scan_state, active_branch, cmd); else if (strcmp(cmd, "gdesc") == 0) status = exec_command_gdesc(scan_state, active_branch); @@ -1282,6 +1285,8 @@ exec_command_f(PsqlScanState scan_state, bool active_branch) /* * \g [filename] -- send query, optionally with output to file/pipe * \gx [filename] -- same as \g, with expanded mode forced + * \gfmt format [filename] -- send result in specified format to + * file/pipe. */ static backslashResult exec_command_g(PsqlScanState scan_state, bool active_branch, const char *cmd) @@ -1290,7 +1295,42 @@ exec_command_g(PsqlScanState scan_state, bool active_branch, const char *cmd) if (active_branch) { - char *fname = psql_scan_slash_option(scan_state, + char *fname; + + if (strcmp(cmd, "gfmt") == 0) + { + char *fmtname = psql_scan_slash_option(scan_state, + OT_NORMAL, NULL, false); + + if (!fmtname) + { +pg_log_error("no format name"); + +return PSQL_CMD_ERROR; + } + else + { +enum printFormat format; +bool result; + +result = format_number(fmtname, strlen(fmtname), &format); + +free(fmtname); + +if (result) +{ + pset.gfmt_format = format; +} +else +{ + pg_log_error("\\pset: allowed formats are aligned, asciidoc, csv, html, latex, latex-longtable, troff-ms, unaligned, wrapped"); + + return PSQL_CMD_ERROR; +} + } + } + + fname = psql_scan_slash_option(scan_state, OT_FILEPIPE, NULL, false); if (!fname) @@ -3777,6 +3817,68 @@ _unicode_linestyle2string(int linestyle) return "unknown"; } +/* + * Returns true if format name was recognized. + */ +static bool +format_number(const char *value, int vallen, enum printFormat *numformat) +{ + static const struct fmt + { + const char *name; + enum printFormat number; + } formats[] = + { + /* remember to update error message below when adding more */ + {"aligned", PRINT_ALIGNED}, + {"asciidoc", PRINT_ASCIIDOC}, + {"csv", PRINT_CSV}, + {"html", PRINT_HTML}, + {
Re: backup manifests
> On Mar 26, 2020, at 9:34 AM, Stephen Frost wrote: > > I'm not actually argueing about which hash functions we should support, > but rather what the default is and if crc32c, specifically, is actually > a reasonable choice. Just because it's fast and we already had an > implementation of it doesn't justify its use as the default. Given that > it doesn't actually provide the check that is generally expected of > CRC checksums (100% detection of single-bit errors) when the file size > gets over 512MB makes me wonder if we should have it at all, yes, but it > definitely makes me think it shouldn't be our default. I don't understand your focus on the single-bit error issue. If you are sending your backup across the wire, single bit errors during transmission should already be detected as part of the networking protocol. The real issue has to be detection of the kinds of errors or modifications that are most likely to happen in practice. Which are those? People manually mucking with the files? Bugs in backup scripts? Corruption on the storage device? Truncated files? The more bits in the checksum (assuming a well designed checksum algorithm), the more likely we are to detect accidental modification, so it is no surprise if a 64-bit crc does better than 32-bit crc. But that logic can be taken arbitrarily far. I don't see the connection between, on the one hand, an analysis of single-bit error detection against file size, and on the other hand, the verification of backups. From a support perspective, I think the much more important issue is making certain that checksums are turned on. A one in a billion chance of missing an error seems pretty acceptable compared to the, let's say, one in two chance that your customer didn't use checksums. Why are we even allowing this to be turned off? Is there a usage case compelling that option? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: plan cache overhead on plpgsql expression
Amit Langote writes: > One thing -- I don't get the division between > CachedPlanAllowsSimpleValidityCheck() and CachedPlanIsSimplyValid(). > Maybe I am missing something, but could there not be just one > function, possibly using whether expr_simple_expr is set or not to > skip or do, resp., the checks that the former does? Well, we don't want to do the initial checks over again every time; we want the is-valid test to be as simple and fast as we can make it. I suppose we could have one function with a boolean flag saying "this is a recheck", but I don't find that idea to be any better than the way it is. Also, although the existing structure in plpgsql always calls CachedPlanIsSimplyValid immediately after a successful call to CachedPlanAllowsSimpleValidityCheck, I don't think that's necessarily going to be true for other potential users of the functions. So merging the functions would reduce flexibility. regards, tom lane
Re: A bug when use get_bit() function for a long bytea string
On Thu, 26 Mar 2020 at 19:39, Tom Lane wrote: > Ashutosh Bapat writes: > > On Wed, 18 Mar 2020 at 08:18, movead...@highgo.ca > > wrote: > >> if we change return type of all those functions to int64, we won't need > >> this cast. > >> I change the 'encode' function, it needs an int64 return type, but keep > >> other > >> functions in 'pg_encoding', because I think it of no necessary reason. > > > Ok, let's leave it for a committer to decide. > > If I'm grasping the purpose of these correctly, wouldn't Size or size_t > be a more appropriate type? Andy had used Size in his earlier patch. But I didn't understand the reason behind it and Andy didn't give any reason. From the patch and the code around the changes some kind of int (so int64) looked better. But if there's a valid reason for using Size, I am fine with it too. Do we have a SQL datatype corresponding to Size? -- Best Wishes, Ashutosh
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-03-26 02:40, Justin Pryzby wrote: On Thu, Mar 12, 2020 at 08:08:46PM +0300, Alexey Kondratov wrote: On 09.03.2020 23:04, Justin Pryzby wrote: On Sat, Feb 29, 2020 at 08:53:04AM -0600, Justin Pryzby wrote: On Sat, Feb 29, 2020 at 03:35:27PM +0300, Alexey Kondratov wrote: tests for that. (I'm including your v8 untouched in hopes of not messing up the cfbot). My fixes avoid an issue if you try to REINDEX onto pg_default, I think due to moving system toast indexes. I was able to avoid this issue by adding a call to GetNewRelFileNode, even though that's already called by RelationSetNewRelfilenode(). Not sure if there's a better way, or if it's worth Alexey's v3 patch which added a tablespace param to RelationSetNewRelfilenode. Do you have any understanding of what exactly causes this error? I have tried to debug it a little bit, but still cannot figure out why we need this extra GetNewRelFileNode() call and a mechanism how it helps. The PANIC is from smgr hashtable, which couldn't find an entry it expected. My very tentative understanding is that smgr is prepared to handle a *relation* which is dropped/recreated multiple times in a transaction, but it's *not* prepared to deal with a given RelFileNode(Backend) being dropped/recreated, since that's used as a hash key. I revisited it and solved it in a somewhat nicer way. I included your new solution regarding this part from 0004 into 0001. It seems that at least a tip of the problem was in that we tried to change tablespace to pg_default being already there. It's still not clear to me if there's an issue with your original way of adding a tablespace parameter to RelationSetNewRelfilenode(). Yes, it is not clear for me too. Many thanks for you review and fixups! There are some inconsistencies like mentions of SET TABLESPACE in error messages and so on. I am going to refactor and include your fixes 0003-0004 into 0001 and 0002, but keep 0005 separated for now, since this part requires more understanding IMO (and comparison with v4 implementation). I'd suggest to keep the CLUSTER/VACUUM FULL separate from REINDEX, in case Michael or someone else wants to progress one but cannot commit to both. Yes, sure, I did not have plans to melt everything into a single patch. So, it has taken much longer to understand and rework all these fixes and permission validations. Attached is the updated patch set. 0001: — It is mostly the same, but refactored — I also included your most recent fix for REINDEX DATABASE with allow_system_table_mods=1 — With this patch REINDEX + TABLESPACE simply errors out, when index on TOAST table is met and allow_system_table_mods=0 0002: — I reworked it a bit, since REINDEX CONCURRENTLY is not allowed on system catalog anyway, that is checked at the hegher levels of statement processing. So we have to care about TOAST relations — Also added the same check into the plain REINDEX — It works fine, but I am not entirely happy that with this patch errors/warnings are a bit inconsistent: template1=# REINDEX INDEX CONCURRENTLY pg_toast.pg_toast_12773_index TABLESPACE pg_default; WARNING: skipping tablespace change of "pg_toast_12773_index" DETAIL: Cannot move system relation, only REINDEX CONCURRENTLY is performed. template1=# REINDEX TABLE CONCURRENTLY pg_toast.pg_toast_12773 TABLESPACE pg_default; ERROR: permission denied: "pg_toast_12773" is a system catalog And REINDEX DATABASE CONCURRENTLY will generate a warning again. Maybe we should always throw a warning and do only reindex if it is not possible to change tablespace? 0003: — I have get rid of some of previous refactoring pieces like check_relation_is_movable for now. Let all these validations to settle and then think whether we could do it better — Added CLUSTER to copy/equalfuncs — Cleaned up messages and comments I hope that I did not forget anything from your proposals. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company From 692bcadfacfa0c47fec7b6969525d33d0cac1f83 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 24 Mar 2020 18:16:06 +0300 Subject: [PATCH v12 3/3] Allow CLUSTER and VACUUM FULL to change tablespace --- doc/src/sgml/ref/cluster.sgml | 11 - doc/src/sgml/ref/vacuum.sgml | 11 + src/backend/commands/cluster.c| 58 --- src/backend/commands/vacuum.c | 51 ++-- src/backend/nodes/copyfuncs.c | 2 + src/backend/nodes/equalfuncs.c| 2 + src/backend/parser/gram.y | 45 -- src/backend/postmaster/autovacuum.c | 1 + src/include/commands/cluster.h| 2 +- src/include/commands/vacuum.h | 2 + src/include/nodes/parsenodes.h| 2 + src/test/regress/input/tablespace.source | 23 - src/test/regress/output/tablespace.source | 37 +++
Tab completion for \gx
While reviewing the patch for \gf, I noticed that \gx does not have tab completion for its optional filename. Trivial patch attached. I would also suggest this be backpatched. -- Vik Fearing diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index ae35fa4aa9..7252b6c4e6 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -3882,7 +3882,7 @@ psql_completion(const char *text, int start, int end) COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_routines, NULL); else if (TailMatchesCS("\\sv*")) COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_views, NULL); - else if (TailMatchesCS("\\cd|\\e|\\edit|\\g|\\i|\\include|" + else if (TailMatchesCS("\\cd|\\e|\\edit|\\g|\\gx|\\i|\\include|" "\\ir|\\include_relative|\\o|\\out|" "\\s|\\w|\\write|\\lo_import")) {
Re: proposal \gcsv
On 3/24/20 3:02 AM, Pavel Stehule wrote: > Hi > > rebase Thank you, Pavel. I have now had time to review it, and it looks good to me except for two issues. The first is, even though I suggested gf, I think it should actually be gfmt. There may be something else in the future that starts with f and we shouldn't close ourselves off to it. The second is tab completion doesn't work for the second argument. Adding the following fixes that: diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index ed6945a7f12..9d8cf442972 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -3786,6 +3786,12 @@ psql_completion(const char *text, int start, int end) COMPLETE_WITH_CS("aligned", "asciidoc", "csv", "html", "latex", "latex-longtable", "troff-ms", "unaligned", "wrapped"); + else if (TailMatchesCS("\\gf", MatchAny)) + { + completion_charp = "\\"; + completion_force_quote = false; + matches = rl_completion_matches(text, complete_from_files); + } else if (TailMatchesCS("\\h|\\help")) COMPLETE_WITH_LIST(sql_commands); After some opinions on the first issue and fixing the second, I think this is good to be committed. -- Vik Fearing
Re: Patch: to pass query string to pg_plan_query()
On Thu, Mar 26, 2020 at 11:44:44AM -0400, Tom Lane wrote: > Fujii Masao writes: > > Does anyone object to this patch? I'm thinking to commit it separetely > > at first before committing the planning_counter_in_pg_stat_statements > > patch. > > I took a quick look through v9-0001-Pass-query-string-to-the-planner.patch > and it's fine by me. It also matches up with something I've wanted to > do for awhile, which is to make the query string available during > planning and execution so that we can produce error cursors for > run-time errors, when relevant. > > (It's a little weird that the patch doesn't make standard_planner > actually *do* anything with the string, like say save it into > the PlannerInfo struct. But that can come later I guess.) > > Note that I wouldn't want to bet that all of these call sites always have > non-null query strings to pass; but probably most of the time they will. Surprinsingly, the whole regression tests pass flawlessly with an non-null query string assert, but we did had some discussion about it. The pending IVM patch would break that assumption, same as some non trivial extensions like citus (see https://www.postgresql.org/message-id/flat/CAFMSG9HJQr%3DH8doWJOp%3DwqyKbVqxMLkk_Qu2KfpmkKvS-Xn7qQ%40mail.gmail.com#ab8ea541b8c8464f7b52ba6d8d480b7d and later), so we didn't make it a hard requirement.
Re: Make mesage at end-of-recovery less scary.
On Wed, Mar 25, 2020 at 8:53 AM Peter Eisentraut wrote: > HINT: This is to be expected if this is the end of the WAL. Otherwise, > it could indicate corruption. First, I agree that this general issue is a problem, because it's come up for me in quite a number of customer situations. Either people get scared when they shouldn't, because the message is innocuous, or they don't get scared about other things that actually are scary, because if some scary-looking messages are actually innocuous, it can lead people to believe that the same is true in other cases. Second, I don't really like the particular formulation you have above, because the user still doesn't know whether or not to be scared. Can we figure that out? I think if we're in crash recovery, I think that we should not be scared, because we have no alternative to assuming that we've reached the end of WAL, so all crash recoveries will end like this. If we're in archive recovery, we should definitely be scared if we haven't yet reached the minimum recovery point, because more WAL than that should certainly exist. After that, it depends on how we got the WAL. If it's being streamed, the question is whether we've reached the end of what got streamed. If it's being copied from the archive, we ought to have the whole segment, but maybe not more. Can we get the right context to the point where the error is being reported to know whether we hit the error at the end of the WAL that was streamed? If not, can we somehow rejigger things so that we only make it sound scary if we keep getting stuck at the same point when we woud've expected to make progress meanwhile? I'm just spitballing here, but it would be really good if there's a way to know definitely whether or not you should be scared. Corrupted WAL segments are definitely a thing that happens, but retries are a lot more common. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: backup manifests
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Wed, Mar 25, 2020 at 4:54 PM Stephen Frost wrote: > > > That's a fair argument, but I think the other relevant principle is > > > that we try to give people useful defaults for things. I think that > > > checksums are a sufficiently useful thing that having the default be > > > not to do it doesn't make sense. I had the impression that you and > > > David were in agreement on that point, actually. > > > > I agree with wanting to have useful defaults and that checksums should > > be included by default, and I'm alright even with letting people pick > > what algorithms they'd like to have too. The construct here is made odd > > because we've got this idea that "no checksum" is an option, which is > > actually something that I don't particularly like, but that's what's > > making this particular syntax weird. I don't suppose you'd be open to > > the idea of just dropping that though..? There wouldn't be any issue > > with this syntax if we just always had checksums included when a > > manifest is requested. :) > > > > Somehow, I don't think I'm going to win that argument. > > Well, it's not a crazy idea. So, at some point, I had the idea that > you were always going to get a manifest, and therefore you should at > least ought to have the option of not checksumming to avoid the > overhead. But, as things stand now, you can suppress the manifest > altogether, so that you can still take a backup even if you've got no > disk space to spool the manifest on the master. So, if you really want > no overhead from manifests, just don't have a manifest. And if you are > OK with some overhead, why not at least have a CRC-32C checksum, which > is, after all, pretty cheap? > > Now, on the other hand, I don't have any strong evidence that the > manifest-without-checksums mode is useless. You can still use it to > verify that you have the correct files and that those files have the > expected sizes. And, verifying those things is very cheap, because you > only need to stat() each file, not open and read them all. True, you > can do those things by using pg_validatebackup -s. But, you'd still > incur the (admittedly fairly low) overhead of computing checksums that > you don't intend to use. > > This is where I feel like I'm trying to make decisions in a vacuum. If > we had a few more people weighing in on the thread on this point, I'd > be happy to go with whatever the consensus was. If most people think > having both --no-manifest (suppressing the manifest completely) and > --manifest-checksums=none (suppressing only the checksums) is useless > and confusing, then sure, let's rip the latter one out. If most people > like the flexibility, let's keep it: it's already implemented and > tested. But I hate to base the decision on what one or two people > think. I'm frustrated at the lack of involvement from others also. Just to be clear- I'm not completely against having a 'manifest but no checksum' option, but if that's what we're going to have then it seems like the syntax should be such that if you don't specify checksums then you don't get checksums and "MANIFEST_CHECKSUM none" shouldn't be a thing. All that said, as I said up-thread, I appreciate that we aren't designing SQL here and that this is pretty special syntax to begin with, so if you ended up committing it the way you have it now, so be it, I wouldn't be asking for it to be reverted over this. It's a bit awkward and kind of a thorn, but it's not entirely unreasonable, and we'd probably end up there anyway if we started out without a 'none' option and someone did come up with a good argument and a patch to add such an option in the future. > > > Well, the 512MB "limit" for CRC-32C means only that for certain very > > > specific types of errors, detection is not guaranteed above that file > > > size. So if you have a single flipped bit, for example, and the file > > > size is greater than 512MB, then CRC-32C has only a 99.999767169% > > > chance of detecting the error, whereas if the file size is less than > > > 512MB, it is 100% certain, because of the design of the algorithm. But > > > nine nines is plenty, and neither SHA nor our page-level checksums > > > provide guaranteed error detection properties anyway. > > > > Right, so we know that CRC-32C has an upper-bound of 512MB to be useful > > for exactly what it's designed to be useful for, but we also know that > > we're going to have larger files- at least 1GB ones, and quite possibly > > larger, so why are we choosing this? > > > > At the least, wouldn't it make sense to consider a larger CRC, one whose > > limit is above the size of commonly expected files, if we're going to > > use a CRC? > > I mean, you're just repeating the same argument here, and it's just > not valid. Regardless of the file size, the chances of a false > checksum match are literally less than one in a billion. There is > every reason to believe that users will be
Re: [PATCH] Fix CommitTransactionCommand() to CallXactCallbacks() in TBLOCK_ABORT_END
Le 25/03/2020 à 03:49, Dave Sharpe a écrit : > > Hi pghackers, > > This is my first time posting here ... Gilles Darold and I are > developing a new FDW which is based on the contrib/postgres_fdw. The > postgres_fdw logic uses a RegisterXactCallback function to send local > transactions remote. But I found that a registered XactCallback is not > always called for a successful client ROLLBACK statement. So, a > successful local ROLLBACK does not get sent remote by FDW, and now the > local and remote transaction states are messed up, out of sync. The > local database is "eating" the successful rollback. > > > > I attach a git format-patch file > 0001-Fix-CommitTransactionCommand-to-CallXactCallbacks-in.patch > > The patch fixes the problem and is ready to commit as far as I can > tell. It adds some comment lines and three lines of code to > CommitTransactionCommand() in the TBLOCK_ABORT_END case. Line (1) to > reset the transaction's blockState back to TBLOCK_ABORT, ahead of (2) > a new call to callXactCallbacks(). If the callback returns successful > (which is usually the case), (3) the new code switches back to > TBLOCK_ABORT_END, then continues with old code to CleanupTransaction() > as before. If any callback does error out, the TBLOCK_ABORT was the > correct block state for an error. > > > > I have not added a regression test since my scenario involves a C > module ... I didn't see such a regression test, but somebody can teach > me where to put the C module and Makefile if a regression test should > be added. Heads up that the scenario to hit this is a bit complex (to > me) ... I attach the following unit test files: > > 1) eat_rollback.c, _/PG_init() installs my/_utility_hook() for INFO > log for debugging. > > RegisterSubXactCallback(mySubtransactionCallback) which injects some > logging and an ERROR for savepoints, i.e., negative testing, e.g., > like a remote FDW savepoint failed. > > And RegisterXactTransaction(myTransactionCallback) with info logging. > > 2) Makefile, to make the eat_rollback.so > > 3) eat_rollback2.sql, drives the fail sequence, especially the > SAVEPOINT, which errors out, then later a successful ROLLBACK gets > incorrectly eaten (no CALLBACK info logging, demonstrates the bug), > then another successful ROLLBACK (now there is CALLBACK info logging). > > 4) eat_rollback2.out, output without the fix, the rollback at line 68 > is successful but there is not myTransactionCallback() INFO output > > 5) eat_rollback2.fixed, output after the fix, the rollback at line 68 > is successful, and now there is a myTransactionCallback() INFO log. > Success! > > > > It first failed for me in v11.1, I suspect it failed since before then > too. And it is still failing in current master. > > > > Bye the way, we worked around the bug in our FDW by handling sub/xact > in the utility hook. A transaction callback is still needed for > implicit, internal rollbacks. Another advantage of the workaround is > (I think) it reduces the code complexity and improves performance > because the entire subxact stack is not unwound to drive each and > every subtransaction level to remote. Also sub/transaction statements > are sent remote as they arrive (local and remote are more > "transactionally" synced, not stacked by FDW for later). And of > course, the workaround doesn't hit the above bug, since our utility > hook correctly handles the client ROLLBACK. If it makes sense to the > community, I could try and patch contrib/postgres_fdw to do what we > are doing. But note that I don't need it myself: we have our own new > FDW for remote DB2 z/OS (!) ... at LzLabs we are a building a software > defined mainframe with PostgreSQL (of all things). > > > > Hope it helps! > > > > Dave Sharpe > > /I don't necessarily agree with everything I say./(MM) > > www.lzlabs.com > Hi, As I'm involved in this thread I have given a review to this bug report and I don't think that there is a problem here but as a disclaimer my knowledge on internal transaction management is probably not enough to have a definitive opinion. Actually the callback function is called when the error is thrown: psql:eat_rollback2.sql:20: INFO: 0: myTransactionCallback() XactEvent 2 (is abort) level 1 <--- LOCATION: myTransactionCallback, eat_rollback.c:52 psql:eat_rollback2.sql:20: ERROR: XX000: no no no LOCATION: mySubtransactionCallback, eat_rollback.c:65 this is probably why the callback is not called on the subsequent ROLLBACK execution because abort processing is already done (src/backend/access/transam/xact.c:3890). With this simple test and the debug extension: BEGIN; DROP INDEX "index2", "index3"; -- will fail indexes doesn't exist ROLLBACK; the callback function is also called at error level, not on the ROLLBACK: -BEGIN psql:eat_rollback-2/sql/eat_rollback3.sql:39: INFO: 0:
Re: Columns correlation and adaptive query optimization
On 25.03.2020 20:04, Rafia Sabih wrote: Also, there is no description about any of the functions here, wouldn’t hurt having some more comments there. Attached please find new version of the patch with more comments and descriptions added. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c index f69dde8..8ab95a1 100644 --- a/contrib/auto_explain/auto_explain.c +++ b/contrib/auto_explain/auto_explain.c @@ -13,12 +13,32 @@ #include "postgres.h" #include +#include +#include "access/hash.h" #include "access/parallel.h" +#include "access/relscan.h" +#include "access/skey.h" +#include "access/table.h" +#include "access/tableam.h" +#include "catalog/pg_statistic_ext.h" #include "commands/explain.h" +#include "commands/defrem.h" #include "executor/instrument.h" #include "jit/jit.h" +#include "nodes/makefuncs.h" +#include "nodes/nodeFuncs.h" +#include "optimizer/cost.h" +#include "optimizer/optimizer.h" +#include "optimizer/planmain.h" +#include "parser/parsetree.h" +#include "storage/ipc.h" +#include "statistics/statistics.h" +#include "utils/fmgroids.h" #include "utils/guc.h" +#include "utils/syscache.h" +#include "utils/lsyscache.h" +#include "utils/ruleutils.h" PG_MODULE_MAGIC; @@ -33,7 +53,9 @@ static bool auto_explain_log_settings = false; static int auto_explain_log_format = EXPLAIN_FORMAT_TEXT; static int auto_explain_log_level = LOG; static bool auto_explain_log_nested_statements = false; +static bool auto_explain_suggest_only = false; static double auto_explain_sample_rate = 1; +static double auto_explain_add_statistics_threshold = 0.0; static const struct config_enum_entry format_options[] = { {"text", EXPLAIN_FORMAT_TEXT, false}, @@ -84,6 +106,7 @@ static void explain_ExecutorRun(QueryDesc *queryDesc, static void explain_ExecutorFinish(QueryDesc *queryDesc); static void explain_ExecutorEnd(QueryDesc *queryDesc); +static void AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es); /* * Module load callback @@ -218,6 +241,30 @@ _PG_init(void) NULL, NULL); + DefineCustomRealVariable("auto_explain.add_statistics_threshold", + "Sets the threshold for actual/estimated #rows ratio triggering creation of multicolumn statistic for the related columns.", + "Zero disables implicit creation of multicolumn statistic.", + &auto_explain_add_statistics_threshold, + 0.0, + 0.0, + INT_MAX, + PGC_SUSET, + 0, + NULL, + NULL, + NULL); + + DefineCustomBoolVariable("auto_explain.suggest_only", + "Do not create statistic but just record in WAL suggested create statistics statement.", + NULL, + &auto_explain_suggest_only, + false, + PGC_SUSET, + 0, + NULL, + NULL, + NULL); + EmitWarningsOnPlaceholders("auto_explain"); /* Install hooks. */ @@ -349,6 +396,276 @@ explain_ExecutorFinish(QueryDesc *queryDesc) PG_END_TRY(); } +/** + * Try to add multicolumn statistics for specified subplans. + */ +static void +AddMultiColumnStatisticsForSubPlans(List *plans, ExplainState *es) +{ + ListCell *lst; + + foreach(lst, plans) + { + SubPlanState *sps = (SubPlanState *) lfirst(lst); + + AddMultiColumnStatisticsForNode(sps->planstate, es); + } +} + +/** + * Try to add multicolumn statistics for plan subnodes. + */ +static void +AddMultiColumnStatisticsForMemberNodes(PlanState **planstates, int nsubnodes, + ExplainState *es) +{ + int j; + + for (j = 0; j < nsubnodes; j++) + AddMultiColumnStatisticsForNode(planstates[j], es); +} + +/** + * Comparator used to sort Vars by name + */ +static int +vars_list_comparator(const ListCell *a, const ListCell *b) +{ + char* va = strVal((Value *) linitial(((ColumnRef *)lfirst(a))->fields)); + char* vb = strVal((Value *) linitial(((ColumnRef *)lfirst(b))->fields)); + return strcmp(va, vb); +} + +/** + * Try to add multicolumn statistics for qual + */ +static void +AddMultiColumnStatisticsForQual(void* qual, ExplainState *es) +{ + List *vars = NULL; + ListCell* lc; + + /* Extract vars from all quals */ + foreach (lc, qual) + { + Node* node = (Node*)lfirst(lc); + if (IsA(node, RestrictInfo)) + node = (Node*)((RestrictInfo*)node)->clause; + vars = list_concat(vars, pull_vars_of_level(node, 0)); + } + + /* Loop until we considered all vars */ + while (vars != NULL) + { + ListCell *cell; + List *cols = NULL; + Index varno = 0; + Bitmapset* colmap = NULL; + + /* Contruct list of unique vars */ + foreach (cell, vars) + { + Node* node = (Node *) lfirst(cell); + if (IsA(node, Var)) + { +Var *var = (Var *) node; +if (cols == NULL || var->varno == varno) +{ + varno = var->varno; + if (var->varattno > 0 && + !bms_is_member(var->varattno, colmap) && + varno >= 1 && /* not synthetic var */ +
Re: Patch: to pass query string to pg_plan_query()
Fujii Masao writes: > Does anyone object to this patch? I'm thinking to commit it separetely > at first before committing the planning_counter_in_pg_stat_statements > patch. I took a quick look through v9-0001-Pass-query-string-to-the-planner.patch and it's fine by me. It also matches up with something I've wanted to do for awhile, which is to make the query string available during planning and execution so that we can produce error cursors for run-time errors, when relevant. (It's a little weird that the patch doesn't make standard_planner actually *do* anything with the string, like say save it into the PlannerInfo struct. But that can come later I guess.) Note that I wouldn't want to bet that all of these call sites always have non-null query strings to pass; but probably most of the time they will. regards, tom lane
Re: backup manifests
On Wed, Mar 25, 2020 at 4:54 PM Stephen Frost wrote: > > That's a fair argument, but I think the other relevant principle is > > that we try to give people useful defaults for things. I think that > > checksums are a sufficiently useful thing that having the default be > > not to do it doesn't make sense. I had the impression that you and > > David were in agreement on that point, actually. > > I agree with wanting to have useful defaults and that checksums should > be included by default, and I'm alright even with letting people pick > what algorithms they'd like to have too. The construct here is made odd > because we've got this idea that "no checksum" is an option, which is > actually something that I don't particularly like, but that's what's > making this particular syntax weird. I don't suppose you'd be open to > the idea of just dropping that though..? There wouldn't be any issue > with this syntax if we just always had checksums included when a > manifest is requested. :) > > Somehow, I don't think I'm going to win that argument. Well, it's not a crazy idea. So, at some point, I had the idea that you were always going to get a manifest, and therefore you should at least ought to have the option of not checksumming to avoid the overhead. But, as things stand now, you can suppress the manifest altogether, so that you can still take a backup even if you've got no disk space to spool the manifest on the master. So, if you really want no overhead from manifests, just don't have a manifest. And if you are OK with some overhead, why not at least have a CRC-32C checksum, which is, after all, pretty cheap? Now, on the other hand, I don't have any strong evidence that the manifest-without-checksums mode is useless. You can still use it to verify that you have the correct files and that those files have the expected sizes. And, verifying those things is very cheap, because you only need to stat() each file, not open and read them all. True, you can do those things by using pg_validatebackup -s. But, you'd still incur the (admittedly fairly low) overhead of computing checksums that you don't intend to use. This is where I feel like I'm trying to make decisions in a vacuum. If we had a few more people weighing in on the thread on this point, I'd be happy to go with whatever the consensus was. If most people think having both --no-manifest (suppressing the manifest completely) and --manifest-checksums=none (suppressing only the checksums) is useless and confusing, then sure, let's rip the latter one out. If most people like the flexibility, let's keep it: it's already implemented and tested. But I hate to base the decision on what one or two people think. > > Well, the 512MB "limit" for CRC-32C means only that for certain very > > specific types of errors, detection is not guaranteed above that file > > size. So if you have a single flipped bit, for example, and the file > > size is greater than 512MB, then CRC-32C has only a 99.999767169% > > chance of detecting the error, whereas if the file size is less than > > 512MB, it is 100% certain, because of the design of the algorithm. But > > nine nines is plenty, and neither SHA nor our page-level checksums > > provide guaranteed error detection properties anyway. > > Right, so we know that CRC-32C has an upper-bound of 512MB to be useful > for exactly what it's designed to be useful for, but we also know that > we're going to have larger files- at least 1GB ones, and quite possibly > larger, so why are we choosing this? > > At the least, wouldn't it make sense to consider a larger CRC, one whose > limit is above the size of commonly expected files, if we're going to > use a CRC? I mean, you're just repeating the same argument here, and it's just not valid. Regardless of the file size, the chances of a false checksum match are literally less than one in a billion. There is every reason to believe that users will be happy with a low-overhead method that has a 99.999+% chance of detecting corrupt files. I do agree that a 64-bit CRC would probably be not much more expensive and improve the probability of detecting errors even further, but I wanted to restrict this patch to using infrastructure we already have. The choices there are the various SHA functions (so I supported those), MD5 (which I deliberately omitted, for reasons I hope you'll be the first to agree with), CRC-32C (which is fast), a couple of other CRC-32 variants (which I omitted because they seemed redundant and one of them only ever existed in PostgreSQL because of a coding mistake), and the hacked-up version of FNV that we use for page-level checksums (which is only 16 bits and seems to have no advantages for this purpose). > > I'm not sure why the fact that it's a 40-year-old algorithm is > > relevant. There are many 40-year-old algorithms that are very good. > > Sure there are, but there probably wasn't a lot of thought about > GB-sized files, and this doesn't really seem to be th
Re: error context for vacuum to include block number
On Thu, Mar 26, 2020 at 08:56:54PM +0900, Masahiko Sawada wrote: > 1. > @@ -1844,9 +1914,15 @@ lazy_vacuum_page(Relation onerel, BlockNumber > blkno, Buffer buffer, > int uncnt = 0; > TransactionId visibility_cutoff_xid; > boolall_frozen; > + LVRelStats olderrcbarg; > > pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno); > > + /* Update error traceback information */ > + olderrcbarg = *vacrelstats; > + update_vacuum_error_cbarg(vacrelstats, VACUUM_ERRCB_PHASE_VACUUM_HEAP, > + blkno, NULL, false); > > Since we update vacrelstats->blkno during in the loop in > lazy_vacuum_heap() we unnecessarily update blkno twice to the same > value. Also I think we don't need to revert back the callback > arguments in lazy_vacuum_page(). Perhaps we can either remove the > change of lazy_vacuum_page() or move the code updating > vacrelstats->blkno to the beginning of lazy_vacuum_page(). I prefer > the latter. We want the error callback to be in place during lazy_scan_heap, since it calls ReadBufferExtended(). We can't remove the change in lazy_vacuum_page, since it's also called from lazy_scan_heap, if there are no indexes. We want lazy_vacuum_page to "revert back" since we go from "scanning heap" to "vacuuming heap". lazy_vacuum_page was the motivation for saving and restoring the called arguments, otherwise lazy_scan_heap() would have to clean up after the function it called, which was unclean. Now, every function cleans up after itself. Does that address your comment ? > +static void > +update_vacuum_error_cbarg(LVRelStats *errcbarg, int phase, BlockNumber blkno, > + char *indname, bool free_oldindname) > > I'm not sure why "free_oldindname" is necessary. Since we initialize > vacrelstats->indname with NULL and revert the callback arguments at > the end of functions that needs update them, vacrelstats->indname is > NULL at the beginning of lazy_vacuum_index() and lazy_cleanup_index(). > And we make a copy of index name in update_vacuum_error_cbarg(). So I > think we can pfree the old index name if errcbarg->indname is not NULL. We want to avoid doing this: olderrcbarg = *vacrelstats // saves a pointer update_vacuum_error_cbarg(... NULL); // frees the pointer and sets indname to NULL update_vacuum_error_cbarg(... olderrcbarg.oldindnam) // puts back the pointer, which has been freed // hit an error, and the callback accesses the pfreed pointer I think that's only an issue for lazy_vacuum_index(). And I think you're right: we only save state when the calling function has a indname=NULL, so we never "put back" a non-NULL indname. We go from having a indname=NULL at lazy_scan_heap to not not-NULL at lazy_vacuum_index, and never the other way around. So once we've "reverted back", 1) the pointer is null; and, 2) the callback function doesn't access it for the previous/reverted phase anyway. Hm, I was just wondering what happens if an error happens *during* update_vacuum_error_cbarg(). It seems like if we set errcbarg->phase=VACUUM_INDEX before setting errcbarg->indname=indname, then an error would cause a crash. And if we pfree and set indname before phase, it'd be a problem when going from an index phase to non-index phase. So maybe we have to set errcbarg->phase=VACUUM_ERRCB_PHASE_UNKNOWN while in the function, and errcbarg->phase=phase last. -- Justin
Re: Columns correlation and adaptive query optimization
Thank you very much for review. On 25.03.2020 20:04, Rafia Sabih wrote: +static void +AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es); + This doesn't look like the right place for it, you might want to declare it with other functions in the starting of the file. Also, there is no description about any of the functions here, wouldn’t hurt having some more comments there. Sorry, I will fix it. Actually this patch contains of two independent parts: first allows to use auto_explain extension to generate mutlicolumn statistic for variables used in clauses for which selectivity estimation gives wrong result. It affects only auto_explain extension. Second part allows to use multicolumn statistic for join selectivity estimation. As far as I know extended statistic is now actively improved: https://www.postgresql.org/message-id/flat/20200309000157.ig5tcrynvaqu4ixd%40development#bfbdf9c41c31ef92819dfc5ecde4a67c I think that using extended statistic for join selectivity is very important and should also be addressed. If my approach is on so good, I will be pleased for other suggestions. A few of more questions that cross my mind at this point, - have you tried measuring the extra cost we have to pay for this mores statistics , and also compare it with the benefit it gives in terms of accuracy. Adding statistic not always leads to performance improvement but I never observed any performance degradation caused by presence of extended statistic. Definitely we can manually create too many extended statistic entries for different subsets of columns. And it certainly increase planning time because optimizer has to consider more alternatives. But in practice I never noticed such slowdown. - I would also be interested in understanding if there are cases when adding this extra step doesn’t help and have you excluded them already or if some of them are easily identifiable at this stage...? Unfortunately there are many cases when extended statistic can not help. Either because optimizer is not able to use it (for example my patch consider only cases with strict equality comparison, but if you use predicate like "a.x=b.x and a.y in (1,2,3)" then extended statistic for can not be used. Either because collected statistic itself is not precise enough , especially in case of data skews. - is there any limit on the number of columns for which this will work, or should there be any such limit...? Right now there is limit for maximal number of columns used in extended statistic: 8 columns. But in practice I rarely see join predicates involving more than 3 columns. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Patch: to pass query string to pg_plan_query()
On Thu, Mar 26, 2020 at 10:54:35PM +0900, Fujii Masao wrote: > > On 2020/03/10 6:31, legrand legrand wrote: > > Hello, > > > > This is a call for committers, reviewers and users, > > regarding "planning counters in pg_stat_statements" > > patch [1] but not only. > > Does anyone object to this patch? I'm thinking to commit it separetely > at first before committing the planning_counter_in_pg_stat_statements > patch. > > > Historically, this version of pg_stat_statements > > with planning counters was performing 3 calls to > > pgss_store() for non utility statements in: > > 1 - pgss_post_parse_analyze (init entry with queryid > > and store query text) > > 2 - pgss_planner_hook (to store planning counters) > > 3 - pgss_ExecutorEnd (to store execution counters) > > > > Then a new version was proposed to remove one call > > to pgss_store() by adding the query string to the > > planner pg_plan_query(): > > But pgss_store() still needs to be called three times even in > non-utility command if the query has constants. Right? Yes indeed, this version is actually adding the 3rd pgss_store call. Passing the query string is a collateral requirement in case the entry disappeared between post parse analysis and planning (which is quite possible with prepared statements at least), as pgss will in this case fallback storing the as-is query string, which is still better that no query text at all. > > 1 - pgss_planner_hook (to store planning counters) > > 2 - pgss_ExecutorEnd (to store execution counters) > > > > Many performances tests where performed concluding > > that there is no impact on this subject. > > Sounds good! > > > Patch "to pass query string to the planner", could be > > committed by itself, and (maybe) used by other extensions. > > > > If this was done, this new version of pgss with planning > > counters could be committed as well, or even later > > (being used as a non core extension starting with pg13). > > > > So please give us your feedback regarding this patch > > "to pass query string to the planner", if you have other > > use cases, or any comment regarding core architecture. > > *As far as I heard*, pg_hint_plan extension uses very tricky way to > extract query string in the planner hook. So this patch would be > very helpful to make pg_hint_plan avoid using such tricky way. +1
Re: replay pause vs. standby promotion
Hello > If we don't ignore walreceiver and does try to connect to the master, > a promotion and recovery cannot end forever since new WAL data can > be streamed. You think this behavior is more consistent? We have no simple point to stop replay. Well, except for "immediately" - just one easy stop. But I agree that this is not the best option. Simple and clear, but not best one for data when we want to replay as much as possible from archive. > IMO it's valid to replay all the WAL data available to avoid data loss > before a promotion completes. But in case of still working primary (with archive_command) we choose quite random time to promote. A random time when the primary did not save the new wal segment. or even when a temporary error of restore_command occurs? We mention just cp command in docs. I know users uses cp (e.g. from NFS) without further error handling. regards, Sergei
Re: Resolving the python 2 -> python 3 mess
Marco Atzeri writes: > Am 17.02.2020 um 17:49 schrieb Tom Lane: >> 1. On platforms where Python 2.x is still supported, recommend that >> packagers continue to build both plpython2 and plpython3, same as now. > there is some documentation on how to build both ? > The INSTALL gives no hint. It's explained in the plpython documentation: basically you have to configure and build the source tree twice (although I think the second time you can just cd into src/pl/plpython and build/install only that much). regards, tom lane
Re: adding partitioned tables to publications
On Wed, Mar 25, 2020 at 9:29 PM Peter Eisentraut wrote: > On 2020-03-23 06:02, Amit Langote wrote: > > Okay, added some tests. > > > > Attached updated patches. > > I have committed the worker.c refactoring patch. > > "Add subscription support to replicate into partitioned tables" still > has lacking test coverage. Your changes in relation.c are not exercised > at all because the partitioned table branch in apply_handle_update() is > never taken. This is critical and tricky code, so I would look for > significant testing. While trying some tests around the code you mentioned, I found what looks like a bug, which looking into now. > The code looks okay to me. I would remove this code > > + memset(entry->attrmap->attnums, -1, > + entry->attrmap->maplen * sizeof(AttrNumber)); > > because the entries are explicitly filled right after anyway, and > filling the bytes with -1 has an unclear effect. There is also > seemingly some fishiness in this code around whether attribute numbers > are zero- or one-based. Perhaps this could be documented briefly. > Maybe I'm misunderstanding something. Will check and fix as necessary. -- Thank you, Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: pg_checksums in backend/storage/page/README
On Thu, Mar 26, 2020 at 03:02:42PM +0100, Daniel Gustafsson wrote: > I noticed in passing that backend/storage/page/README hadn't gotten the memo > about pg_checksums. The attached tiny diff adds a small mention of it. Good catch! LGTM
Re: A bug when use get_bit() function for a long bytea string
Ashutosh Bapat writes: > On Wed, 18 Mar 2020 at 08:18, movead...@highgo.ca > wrote: >> if we change return type of all those functions to int64, we won't need >> this cast. >> I change the 'encode' function, it needs an int64 return type, but keep >> other >> functions in 'pg_encoding', because I think it of no necessary reason. > Ok, let's leave it for a committer to decide. If I'm grasping the purpose of these correctly, wouldn't Size or size_t be a more appropriate type? And I definitely agree with changing all of these APIs at once, if they're all dealing with the same kind of value. regards, tom lane
Re: pg_checksums in backend/storage/page/README
On Thu, Mar 26, 2020 at 3:02 PM Daniel Gustafsson wrote: > > I noticed in passing that backend/storage/page/README hadn't gotten the memo > about pg_checksums. The attached tiny diff adds a small mention of it. That seems obvious enough. Pushed, thanks! -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: plan cache overhead on plpgsql expression
Andres Freund writes: > On 2020-03-25 17:51:50 -0400, Tom Lane wrote: >> Andres Freund writes: >>> Hm, any chance that the multiple resowner calls here could show up in a >>> profile? Probably not? >> Doubt it. On the other hand, as the code stands it's certain that the >> resowner contains nothing but plancache pins (while I was writing the >> patch it wasn't entirely clear that that would hold). So we could >> drop the two unnecessary calls. There are assertions in >> ResourceOwnerDelete that would fire if we somehow missed releasing >> anything, so it doesn't seem like much of a maintenance hazard. > One could even argue that that's a nice crosscheck: Due to the later > release it'd not actually be correct to just add "arbitrary" things to > that resowner. I had a thought about a possibly-cleaner way to do this. We could invent a resowner function, say ResourceOwnerReleaseAllPlanCacheRefs, that explicitly releases all plancache pins it knows about. So plpgsql would not call the regular ResourceOwnerRelease entry point at all, but just call that and then ResourceOwnerDelete, again relying on the assertions therein to catch anything not released. This would be slightly more code but it'd perhaps make it clearer what's going on, without the cost of a duplicative data structure. Perhaps in future there'd be use for similar calls for other resource types. regards, tom lane
pg_checksums in backend/storage/page/README
I noticed in passing that backend/storage/page/README hadn't gotten the memo about pg_checksums. The attached tiny diff adds a small mention of it. cheers ./daniel cksum_page_readme.diff Description: Binary data
Re: Patch: to pass query string to pg_plan_query()
On 2020/03/10 6:31, legrand legrand wrote: Hello, This is a call for committers, reviewers and users, regarding "planning counters in pg_stat_statements" patch [1] but not only. Does anyone object to this patch? I'm thinking to commit it separetely at first before committing the planning_counter_in_pg_stat_statements patch. Historically, this version of pg_stat_statements with planning counters was performing 3 calls to pgss_store() for non utility statements in: 1 - pgss_post_parse_analyze (init entry with queryid and store query text) 2 - pgss_planner_hook (to store planning counters) 3 - pgss_ExecutorEnd (to store execution counters) Then a new version was proposed to remove one call to pgss_store() by adding the query string to the planner pg_plan_query(): But pgss_store() still needs to be called three times even in non-utility command if the query has constants. Right? 1 - pgss_planner_hook (to store planning counters) 2 - pgss_ExecutorEnd (to store execution counters) Many performances tests where performed concluding that there is no impact on this subject. Sounds good! Patch "to pass query string to the planner", could be committed by itself, and (maybe) used by other extensions. If this was done, this new version of pgss with planning counters could be committed as well, or even later (being used as a non core extension starting with pg13). So please give us your feedback regarding this patch "to pass query string to the planner", if you have other use cases, or any comment regarding core architecture. *As far as I heard*, pg_hint_plan extension uses very tricky way to extract query string in the planner hook. So this patch would be very helpful to make pg_hint_plan avoid using such tricky way. Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
Re: A bug when use get_bit() function for a long bytea string
On Wed, 18 Mar 2020 at 08:18, movead...@highgo.ca wrote: > > Hello thanks for the detailed review, > > >I think the second argument indicates the bit position, which would be > max bytea length * 8. If max bytea length covers whole int32, the second > argument >needs to be wider i.e. int64. > Yes, it makes sence and followed. > > I think we need a similar change in byteaGetBit() and byteaSetBit() as well. > > > Some more comments on the patch > > struct pg_encoding > > { > >- unsigned (*encode_len) (const char *data, unsigned dlen); > >+ int64 (*encode_len) (const char *data, unsigned dlen); > > unsigned (*decode_len) (const char *data, unsigned dlen); > > unsigned (*encode) (const char *data, unsigned dlen, char *res); > > unsigned (*decode) (const char *data, unsigned dlen, char *res); > > Why not use return type of int64 for rest of the functions here as well? > > res = enc->encode(VARDATA_ANY(data), datalen, VARDATA(result)); > > /* Make this FATAL 'cause we've trodden on memory ... */ > >- if (res > resultlen) > >+ if ((int64)res > resultlen) > > > >if we change return type of all those functions to int64, we won't need > this cast. > I change the 'encode' function, it needs an int64 return type, but keep > other > > functions in 'pg_encoding', because I think it of no necessary reason. > > Ok, let's leave it for a committer to decide. > >Right now we are using int64 because bytea can be 1GB, but what if we > increase > >that limit tomorrow, will int64 be sufficient? That may be unlikely in > the near > >future, but nevertheless a possibility. Should we then define a new > datatype > >which resolves to int64 right now but really depends upon the bytea > length. I > >am not suggesting that we have to do it right now, but may be something to > >think about. > I decide to use int64 because if we really want to increase the limit, it > should be > the same change with 'text', 'varchar' which have the same limit. So it may > need > a more general struct. Hence I give up the idea. > > Hmm, Let's see what a committer says. Some more review comments. + int64 res,resultlen; We need those on separate lines, possibly. + byteNo = (int32)(n / BITS_PER_BYTE); Does it hurt to have byteNo as int64 so as to avoid a cast. Otherwise, please add a comment explaining the reason for the cast. The comment applies at other places where this change appears. - int len; + int64 len; Why do we need this change? int i; > > > >It might help to add a test where we could pass the second argument > something > >greater than 1G. But it may be difficult to write such a test case. > Add two test cases. > > + +select get_bit( +set_bit((repeat('Postgres', 512 * 1024 * 1024 / 8))::bytea, 1024 * 1024 * 1024 + 1, 0) + ,1024 * 1024 * 1024 + 1); This bit position is still within int4. postgres=# select pg_column_size(1024 * 1024 * 1024 + 1); pg_column_size 4 (1 row) You want something like postgres=# select pg_column_size(512::bigint * 1024 * 1024 * 8); pg_column_size 8 (1 row) -- Best Wishes, Ashutosh
Re: Planning counters in pg_stat_statements (using pgss_store)
On Thu, Mar 26, 2020 at 08:08:35PM +0900, Fujii Masao wrote: > > On 2020/03/25 22:45, Julien Rouhaud wrote: > > On Wed, Mar 25, 2020 at 10:09:37PM +0900, Fujii Masao wrote: > > > + /* > > > + * We can't process the query if no query_text is provided, as > > > pgss_store > > > + * needs it. We also ignore query without queryid, as it would be > > > treated > > > + * as a utility statement, which may not be the case. > > > + */ > > > > > > Could you tell me why the planning stats are not tracked when executing > > > utility statements? In some utility statements like REFRESH MATERIALIZED > > > VIEW, > > > the planner would work. > > > > I explained that in [1]. The problem is that the underlying statement > > doesn't > > get the proper stmt_location and stmt_len, so you eventually end up with two > > different entries. > > It's not problematic to have two different entries in that case. Right? I will unnecessarily bloat the entries, and makes users life harder too. This example is quite easy to deal with, but if the application is sending multi-query statements, you'll just end up with a mess impossible to properly handle. > The actual problem is that the statements reported in those entries are > very similar? For example, when "create table test as select 1;" is executed, > it's strange to get the following two entries, as you explained. > > create table test as select 1; > create table test as select 1 > > But it seems valid to get the following two entries in that case? > > select 1 > create table test as select 1 > > The former is the nested statement and the latter is the top statement. I think that there should only be 1 entry, the utility command. It seems easy to correlate the planning time to the underlying query, but I'm not entirely sure that the execution counters won't be impacted by the fact is being run in a utilty statements. Also, for now current pgss behavior is to always merge underlying optimisable statements in the utility command, and it seems a bit late in this release cycle to revisit that. I'd be happy to work on improving that for the next release, among other things. For instance the total lack of normalization for utility commands [2] is also something that has been bothering me for a long time. In some workloads, you can end up with the entries almost entirely filled with 1-time-execution commands, just because it's using random identifiers, so you have no other choice than to disable track_utility, although it would have been useful for other commands. > Here are other comments. > > - if (jstate) > + if (kind == PGSS_JUMBLE) > > Why is PGSS_JUMBLE necessary? ISTM that we can still use jstate here, instead. > > If it's ok to remove PGSS_JUMBLE, we can define PGSS_NUM_KIND(=2) instead > and replace 2 in, e.g., total_time[2] with PGSS_NUM_KIND. Thought? Yes, we could be using jstate here. I originally used that to avoid passing PGSS_EXEC (or the other one) as a way to say "ignore this information as there's the jstate which says it's yet another meaning". If that's not an issue, I can change that as PGSS_NUM_KIND will clearly improve the explicit "2" all over the place. > + total_time > + double precision > + > + > +Total time spend planning and executing the statement, in > milliseconds > + > + > > pg_stat_statements view has this column but the function not. > We should make both have the column or not at all, for consistency? > I'm not sure if it's good thing to expose the sum of total_plan_time > and total_exec_time as total_time. If some users want that, they can > easily calculate it from total_plan_time and total_exec_time by using > their own logic. I think we originally added it as a way to avoid too much compatibility break, and also because it seems like a field most users will be interested in anyway. Now that I'm thinking about it again, I indeed think it was a mistake to have that in view part only. Not mainly for consistency, but for users who would be interested in the total_time field while not wanting to pay the overhead of retrieving the query text if they don't need it. So I'll change that! > + nested_level++; > + PG_TRY(); > > In old thread [1], Tom Lane commented the usage of nested_level > in the planner hook. There seems no reply to that so far. What's > your opinion about that comment? > > [1] https://www.postgresql.org/message-id/28980.1515803...@sss.pgh.pa.us Oh thanks, I didn't noticed this part of the discussion. I agree with Tom's concern, and I think that having a specific nesting level variable for the planner is the best workaround, so I'll implement that. [2] https://twitter.com/fujii_masao/status/1242978261572837377
Re: replay pause vs. standby promotion
On 2020/03/25 19:42, Sergei Kornilov wrote: Hi Could we add a few words in func.sgml to clarify the behavior? Especially for users from my example above. Something like: If a promotion is triggered while recovery is paused, the paused state ends, replay of any WAL immediately available in the archive or in pg_wal will be continued and then a promotion will be completed. This description is true if pause is requested by pg_wal_replay_pause(), but not if recovery target is reached and pause is requested by recovery_target_action=pause. In the latter case, even if there are WAL data avaiable in pg_wal or archive, they are not replayed, i.e., the promotion completes immediately. Probably we should document those two cases explicitly to avoid the confusion about a promotion and recovery pause? This is description for pg_wal_replay_pause, but actually we suggest to call pg_wal_replay_resume in recovery_target_action=pause... So, I agree, we need to document both cases. PS: I think we have inconsistent behavior here... Read wal during promotion from local pg_wal AND call restore_command, but ignore walreceiver also seems strange for my DBA hat... If we don't ignore walreceiver and does try to connect to the master, a promotion and recovery cannot end forever since new WAL data can be streamed. You think this behavior is more consistent? IMO it's valid to replay all the WAL data available to avoid data loss before a promotion completes. Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
Re: potential stuck lock in SaveSlotToPath()
On 2020-03-25 17:56, Robert Haas wrote: On Wed, Mar 25, 2020 at 6:13 AM Peter Eisentraut wrote: Any concerns about applying and backpatching the patch I posted? Not from me. The talk about reorganizing this code doesn't seem very concrete at the moment and would probably not be backpatch material anyway. +1. committed and backpatched -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: error context for vacuum to include block number
On Thu, 26 Mar 2020 at 15:34, Amit Kapila wrote: > > On Thu, Mar 26, 2020 at 12:03 PM Amit Kapila wrote: > > > > On Thu, Mar 26, 2020 at 10:11 AM Justin Pryzby wrote: > > > > > > Seems fine. Rather than saying "different phases" I, would say: > > > "The index vacuum and heap vacuum phases may be called multiple times in > > > the > > > middle of the heap scan phase." > > > > > > > Okay, I have slightly adjusted the wording as per your suggestion. > > > > > But actually I think the concern is not that we unnecessarily "Revert > > > back to > > > the old phase" but that we do it in a *loop*. Which I agree doesn't make > > > sense, to go back and forth between "scanning heap" and "truncating". > > > > > > > Fair point. I have moved the change to the truncate phase at the > > caller of lazy_heap_truncate() which should address this concern. > > Sawada-San, does this address your concern? > > > > Forgot to attach the patch, doing now. Thank you for updating the patch! The changes around lazy_truncate_heap() looks good to me. I have two comments; 1. @@ -1844,9 +1914,15 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer, int uncnt = 0; TransactionId visibility_cutoff_xid; boolall_frozen; + LVRelStats olderrcbarg; pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno); + /* Update error traceback information */ + olderrcbarg = *vacrelstats; + update_vacuum_error_cbarg(vacrelstats, VACUUM_ERRCB_PHASE_VACUUM_HEAP, + blkno, NULL, false); Since we update vacrelstats->blkno during in the loop in lazy_vacuum_heap() we unnecessarily update blkno twice to the same value. Also I think we don't need to revert back the callback arguments in lazy_vacuum_page(). Perhaps we can either remove the change of lazy_vacuum_page() or move the code updating vacrelstats->blkno to the beginning of lazy_vacuum_page(). I prefer the latter. 2. +/* + * Update vacuum error callback for the current phase, block, and index. + * + * free_oldindname is true if the previous "indname" should be freed. It must be + * false if the caller has copied the old LVRelStats, to avoid keeping a + * pointer to a freed allocation. In which case, the caller should call again + * with free_oldindname as true to avoid a leak. + */ +static void +update_vacuum_error_cbarg(LVRelStats *errcbarg, int phase, BlockNumber blkno, + char *indname, bool free_oldindname) I'm not sure why "free_oldindname" is necessary. Since we initialize vacrelstats->indname with NULL and revert the callback arguments at the end of functions that needs update them, vacrelstats->indname is NULL at the beginning of lazy_vacuum_index() and lazy_cleanup_index(). And we make a copy of index name in update_vacuum_error_cbarg(). So I think we can pfree the old index name if errcbarg->indname is not NULL. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Planning counters in pg_stat_statements (using pgss_store)
On 2020/03/25 22:45, Julien Rouhaud wrote: On Wed, Mar 25, 2020 at 10:09:37PM +0900, Fujii Masao wrote: On 2020/03/21 4:30, Julien Rouhaud wrote: On Fri, Mar 20, 2020 at 05:09:05PM +0300, Sergei Kornilov wrote: Hello Yet another is missed in docs: total_time Oh good catch! I rechecked many time the field, and totally missed that the documentation is referring to the view, which has an additional column, and not the function. Attached v9 fixes that. Thanks for the patch! Here are the review comments from me. - PGSS_V1_3 + PGSS_V1_3, + PGSS_V1_8 WAL usage patch [1] increments this version to 1_4 instead of 1_8. I *guess* that's because previously this version was maintained independently from pg_stat_statements' version. For example, pg_stat_statements 1.4 seems to have used PGSS_V1_3. Oh right. It seems that I changed that many versions ago, I'm not sure why. I'm personally fine with any, but I think this was previously raised and consensus was to keep distinct counters. Unless you prefer to keep it this way, I'll send an updated version (with other possible modifications depending on the rest of the mail) using PGSS_V1_4. + /* +* We can't process the query if no query_text is provided, as pgss_store +* needs it. We also ignore query without queryid, as it would be treated +* as a utility statement, which may not be the case. +*/ Could you tell me why the planning stats are not tracked when executing utility statements? In some utility statements like REFRESH MATERIALIZED VIEW, the planner would work. I explained that in [1]. The problem is that the underlying statement doesn't get the proper stmt_location and stmt_len, so you eventually end up with two different entries. It's not problematic to have two different entries in that case. Right? The actual problem is that the statements reported in those entries are very similar? For example, when "create table test as select 1;" is executed, it's strange to get the following two entries, as you explained. create table test as select 1; create table test as select 1 But it seems valid to get the following two entries in that case? select 1 create table test as select 1 The former is the nested statement and the latter is the top statement. I suggested fixing transformTopLevelStmt() to handle the various DDL that can contain optimisable statements, but everyone preferred to postpone that for a future enhencement. Understood. Thanks for explanation! +static BufferUsage +compute_buffer_counters(BufferUsage start, BufferUsage stop) +{ + BufferUsage result; BufferUsageAccumDiff() has very similar logic. Isn't it better to expose and use that function rather than creating new similar function? Oh, I thought this wouldn't be acceptable. That's indeed better so I'll do that instead. Thanks! But of course this is trivial thing, so it's ok to do that later. values[i++] = Int64GetDatumFast(tmp.rows); values[i++] = Int64GetDatumFast(tmp.shared_blks_hit); values[i++] = Int64GetDatumFast(tmp.shared_blks_read); Previously (without the patch) pg_stat_statements_1_3() reported the buffer usage counters updated only in execution phase. But, in the patched version, pg_stat_statements_1_3() reports the total of buffer usage counters updated in both planning and execution phases. Is this OK? I'm not sure how seriously we should ensure the backward compatibility for pg_stat_statements That's indeed a behavior change, although the new behavior is probably better as user want to know how much resource a query is consuming overall. We could distinguish all buffers with a plan/exec version, but it seems quite overkill. Ok. +/* contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql */ ISTM it's good timing to have also pg_stat_statements--1.8.sql since the definition of pg_stat_statements() is changed. Thought? I thought that since CreateExtension() was modified to be able to find it's way automatically, we shouldn't provide base version anymore, to minimize maintenance burden and also avoid possible bug/discrepancy. The only drawback is that it'll do multiple CREATE or DROP/CREATE of the function usually once per database, which doesn't seem like a big problem. Ok. Here are other comments. - if (jstate) + if (kind == PGSS_JUMBLE) Why is PGSS_JUMBLE necessary? ISTM that we can still use jstate here, instead. If it's ok to remove PGSS_JUMBLE, we can define PGSS_NUM_KIND(=2) instead and replace 2 in, e.g., total_time[2] with PGSS_NUM_KIND. Thought? + total_time + double precision + + +Total time spend planning and executing the statement, in milliseconds + + pg_stat_statements view has this column but the function not. We should make both have the column or not at all, for consistency? I'm not sure if it's good thing
Re: plan cache overhead on plpgsql expression
On Thu, Mar 26, 2020 at 4:44 AM Tom Lane wrote: > > Pavel Stehule writes: > > I'll mark this patch as ready for commiters. > > Thanks for reviewing! Amit, do you have any thoughts on this? Thanks for picking this up. Test cases added by your patch really shows why the plancache and the planner must not be skipped, something I totally failed to grasp. I can't really see any problem with your patch, but mainly due to my unfamiliarity with some of the more complicated things it touches, like resowner stuff. One thing -- I don't get the division between CachedPlanAllowsSimpleValidityCheck() and CachedPlanIsSimplyValid(). Maybe I am missing something, but could there not be just one function, possibly using whether expr_simple_expr is set or not to skip or do, resp., the checks that the former does? -- Thank you, Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: Negative cost is seen for plan node
On Tue, Mar 24, 2020 at 12:01 PM Kyotaro Horiguchi wrote: > At Mon, 23 Mar 2020 21:13:48 +0800, Richard Guo > wrote in > > Hi all, > > > > With the following statements on latest master (c81bd3b9), I find > > negative cost for plan nodes. > > > > create table a (i int, j int); > > insert into a select i%10, i from generate_series(1,100)i; > > analyze a; > > > > # explain select i from a group by i; > >QUERY PLAN > > - > > HashAggregate (cost=1300.00..-1585.82 rows=102043 width=4) > > Good catch! > > >Group Key: i > >Planned Partitions: 4 > >-> Seq Scan on a (cost=0.00..14425.00 rows=100 width=4) > > (4 rows) > > > > In function cost_agg, when we add the disk costs of hash aggregation > > that spills to disk, nbatches is calculated as 1.18 in this case. It is > > greater than 1, so there will be spill. And the depth is calculated as > > -1 in this case, with num_partitions being 4. I think this is where > > thing goes wrong. > > The depth is the expected number of iterations of reading the relation. > > > depth = ceil( log(nbatches - 1) / log(num_partitions) ); > Yes correct. > > I'm not sure what the expression based on, but apparently it is wrong > for nbatches <= 2.0. It looks like a thinko of something like this. > > depth = ceil( log(nbatches) / log(num_partitions + 1) ); > It seems to me we should use '(nbatches - 1)', without the log function. Maybe I'm wrong. I have sent this issue to the 'Memory-Bounded Hash Aggregation' thread. Thanks Richard
Re: Memory-Bounded Hash Aggregation
Hello, When calculating the disk costs of hash aggregation that spills to disk, there is something wrong with how we determine depth: >depth = ceil( log(nbatches - 1) / log(num_partitions) ); If nbatches is some number between 1.0 and 2.0, we would have a negative depth. As a result, we may have a negative cost for hash aggregation plan node, as described in [1]. I don't think 'log(nbatches - 1)' is what we want here. Should it be just '(nbatches - 1)'? [1] https://www.postgresql.org/message-id/flat/CAMbWs4_maqdBnRR4x01pDpoV-CiQ%2BRvMQaPm4JoTPbA%3DmZmhMw%40mail.gmail.com Thanks Richard On Thu, Mar 19, 2020 at 7:36 AM Jeff Davis wrote: > > Committed. > > There's some future work that would be nice (some of these are just > ideas and may not be worth it): > > * Refactor MemoryContextMemAllocated() to be a part of > MemoryContextStats(), but allow it to avoid walking through the blocks > and freelists. > > * Improve the choice of the initial number of buckets in the hash > table. For this patch, I tried to preserve the existing behavior of > estimating the number of groups and trying to initialize with that many > buckets. But my performance tests seem to indicate this is not the best > approach. More work is needed to find what we should really do here. > > * For workloads that are not in work_mem *or* system memory, and need > to actually go to storage, I see poor CPU utilization because it's not > effectively overlapping CPU and IO work. Perhaps buffering or readahead > changes can improve this, or async IO (even better). > > * Project unnecessary attributes away before spilling tuples to disk. > > * Improve logtape.c API so that the caller doesn't need to manage a > bunch of tape numbers. > > * Improve estimate of the hash entry size. This patch doesn't change > the way the planner estimates it, but I observe that actual size as > seen at runtime is significantly different. This is connected to the > initial number of buckets for the hash table. > > * In recursive steps, I don't have a good estimate for the number of > groups, so I just estimate it as the number of tuples in that spill > tape (which is pessimistic). That could be improved by doing a real > cardinality estimate as the tuples are spilling (perhaps with HLL?). > > * Many aggregates with pass-by-ref transition states don't provide a > great aggtransspace. We should consider doing something smarter, like > having negative numbers represent a number that should be multiplied by > the size of the group (e.g. ARRAY_AGG would have a size dependent on > the group size, not a constant). > > * If we want to handle ARRAY_AGG (and the like) well, we can consider > spilling the partial states in the hash table whem the memory is full. > That would add a fair amount of complexity because there would be two > types of spilled data (tuples and partial states), but it could be > useful in some cases. > > Regards, > Jeff Davis > > > > >
Re: allow online change primary_conninfo
Hello Thank you! You were ahead of me. I wanted to double-check my changes this morning before posting. > Sergei included LOG messages to indicate which setting has been changed. > I put these in "#if 0" for now, but I'm thinking to remove these > altogether; No objections. > Not sure if we can or should adjust the FATAL line; probably not worth the > trouble. I think it's ok. walreceiver is terminated exactly due to administrator command. regards, Sergei
Re: Resolving the python 2 -> python 3 mess
On 2020-03-26 06:46, Marco Atzeri wrote: Am 17.02.2020 um 17:49 schrieb Tom Lane: We've had multiple previous discussions of $SUBJECT (eg [1][2]), without any resolution of what to do exactly. Thinking about this some more, I had an idea that I don't think has been discussed. To wit: 1. On platforms where Python 2.x is still supported, recommend that packagers continue to build both plpython2 and plpython3, same as now. there is some documentation on how to build both ? You have to configure and build the sources twice with different PYTHON settings. It depends on your packaging system how to best arrange that. And how to build for multiples 3.x ? That is not supported. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Berserk Autovacuum (let's save next Mandrill)
On Wed, 2020-03-25 at 23:19 +0300, Alexander Korotkov wrote: > On Wed, Mar 25, 2020 at 10:26 PM Andres Freund wrote: > > On 2020-03-25 11:05:21 -0500, Justin Pryzby wrote: > > > Since we talked about how scale_factor can be used to effectively disable > > > this > > > new feature, I thought that scale=100 was too small and suggesed 1e10 > > > (same as > > > max for vacuum_cleanup_index_scale_factor since 4d54543ef). That should > > > allow > > > handling the case that analyze is disabled, or its threshold is high, or > > > it > > > hasn't run yet, or it's running but hasn't finished, or analyze is > > > triggered as > > > same time as vacuum. > > > > For disabling we instead should allow -1, and disable the feature if set > > to < 0. > > This patch introduces both GUC and reloption. In reloptions we > typically use -1 for "disable reloption, use GUC value instead" > semantics. So it's unclear how should we allow reloption to both > disable feature and disable reloption. I think we don't have a > precedent in the codebase yet. We could allow -2 (disable reloption) > and -1 (disable feature) for reloption. Opinions? Here is patch v11, where the reloption has the same upper limit 1e10 as the GUC. There is no good reason to have them different. I am reluctant to introduce new semantics like a reloption value of -2 to disable a feature in this patch right before feature freeze. I believe there are enough options to disable insert-only vacuuming for an individual table: - Set the threshold to 2147483647. True, that will not work for very large tables, but I think that there are few tables that insert that many rows before they hit autovacuum_freeze_max_age anyway. - Set the scale factor to some astronomical value. Yours, Laurenz Albe
Re: Collation versions on Windows (help wanted, apply within)
On Wed, Mar 25, 2020 at 4:18 AM Thomas Munro wrote: > > Thanks! Pushed. > Great! > From the things we learned in this thread, I think there is an open > item for someone to write a patch to call EnumSystemLocalesEx() and > populate the initial set of collations, where we use "locale -a" on > Unix. I'm not sure where the encoding is supposed to come from > though, which is why I didn't try to write a patch myself. > I will take a look at this when the current commitfest is over. Regards, Juan José Santamaría Flecha
Re: Improve handling of parameter differences in physical replication
On Thu, 12 Mar 2020 at 04:34, Peter Eisentraut wrote: > > Here is an updated patch that incorporates some of the suggestions. In > particular, some of the warning messages have been rephrased to more > accurate (but also less specific), the warning message at recovery pause > repeats every 1 minute, and the documentation has been updated. > Thank you for updating the patch. I have one comment on the latest version patch: + do + { + TimestampTz now = GetCurrentTimestamp(); + + if (TimestampDifferenceExceeds(last_warning, now, 6)) + { + ereport(WARNING, + (errmsg("recovery paused because of insufficient parameter settings"), +errdetail("See earlier in the log about which settings are insufficient."), +errhint("Recovery cannot continue unless the configuration is changed and the server restarted."))); + last_warning = now; + } + + pg_usleep(100L);/* 1000 ms */ + HandleStartupProcInterrupts(); + } I think we can set wait event WAIT_EVENT_RECOVERY_PAUSE here. The others look good to me. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Conflict handling for COPY FROM
Hi Takanori Asaba, > > > Although it is a small point, it may be better like this: > +70005 27 36 46 56 -> 70005 27 37 47 57 > > done > I want to discuss about copy from binary file. > It seems that this patch tries to avoid the error that number of field is > different . > > + { > + if (cstate->error_limit > 0 || > cstate->ignore_all_error) > + { > + ereport(WARNING, > + > (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), > +errmsg("skipping \"%s\" > --- row field count is %d, expected %d", > + > cstate->line_buf.data, (int) fld_count, attr_count))); > + cstate->error_limit--; > + goto next_line; > + } > + else > + ereport(ERROR, > + > (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), > +errmsg("row field count > is %d, expected %d", > + (int) > fld_count, attr_count))); > + > + } > > I checked like this: > > postgres=# CREATE TABLE x ( > postgres(# a serial UNIQUE, > postgres(# b int, > postgres(# c text not null default 'stuff', > postgres(# d text, > postgres(# e text > postgres(# ); > CREATE TABLE > postgres=# COPY x from stdin; > Enter data to be copied followed by a newline. > End with a backslash and a period on a line by itself, or an EOF signal. > >> 7000425 35 45 55 > >> 7000526 36 46 56 > >> \. > COPY 2 > postgres=# SELECT * FROM x; >a | b | c | d | e > ---++++ > 70004 | 25 | 35 | 45 | 55 > 70005 | 26 | 36 | 46 | 56 > (2 rows) > > postgres=# COPY x TO '/tmp/copyout' (FORMAT binary); > COPY 2 > postgres=# CREATE TABLE y ( > postgres(# a serial UNIQUE, > postgres(# b int, > postgres(# c text not null default 'stuff', > postgres(# d text > postgres(# ); > CREATE TABLE > postgres=# COPY y FROM '/tmp/copyout' WITH (FORMAT binary,ERROR_LIMIT -1); > 2020-03-12 16:55:55.457 JST [2319] WARNING: skipping "" --- row field > count is 5, expected 4 > 2020-03-12 16:55:55.457 JST [2319] CONTEXT: COPY y, line 1 > 2020-03-12 16:55:55.457 JST [2319] WARNING: skipping "" --- row field > count is 0, expected 4 > 2020-03-12 16:55:55.457 JST [2319] CONTEXT: COPY y, line 2 > 2020-03-12 16:55:55.457 JST [2319] ERROR: unexpected EOF in COPY data > 2020-03-12 16:55:55.457 JST [2319] CONTEXT: COPY y, line 3, column a > 2020-03-12 16:55:55.457 JST [2319] STATEMENT: COPY y FROM '/tmp/copyout' > WITH (FORMAT binary,ERROR_LIMIT -1); > WARNING: skipping "" --- row field count is 5, expected 4 > WARNING: skipping "" --- row field count is 0, expected 4 > ERROR: unexpected EOF in COPY data > CONTEXT: COPY y, line 3, column a > > It seems that the error isn't handled. > 'WARNING: skipping "" --- row field count is 5, expected 4' is correct, > but ' WARNING: skipping "" --- row field count is 0, expected 4' is not > correct. > > Thank you for the detailed example > Also, is it needed to skip the error that happens when input is binary > file? > Is the case that each row has different number of field and only specific > rows are copied occurred? > > An error that can be surly handled without transaction rollback can be included in error handling but i will like to proceed without binary file errors handling for the time being regards Surafel conflict-handling-copy-from-v16.patch Description: Binary data
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On Tue, Mar 24, 2020 at 12:22:16PM +0900, Michael Paquier wrote: > Thanks Alvaro and Alexander. 0001 has been applied as of e09ad07. > Now for 0002, let's see about it later. Attached is a rebased > version, with no actual changes. I was looking at this patch again today and I am rather fine with the existing semantics. Still I don't like much to name the frontend-side routine FrontendRestoreArchivedFile() and use a different name than the backend counterpart because we have to include xlog_internal.h in fe_archive.c to be able to grab XLOGDIR. So here is an idea: let's move the declaration of the routines part of xlogarchive.c to a new header, called xlogarchive.h, and then let's use the same routine name for the frontend and the backend in this second patch. We include xlog_internal.h already in many frontend tools, so that would clean up things a bit. Two extra things are the routines for the checkpointer as well as the variables like ArchiveRecoveryRequested. It may be nice to move those while on it, but I am not sure where and that's not actually required for this patch set so that could be addressed later if need be. Any thoughts? -- Michael signature.asc Description: PGP signature
Re: A rather hackish POC for alternative implementation of WITH TIES
On Wed, Jan 22, 2020 at 3:35 PM Andrew Gierth wrote: > > "Alvaro" == Alvaro Herrera writes: > > > I was largely holding off on doing further work hoping for some > discussion of which way we should go. If you think my approach is worth > pursuing (I haven't seriously tested the performance, but I'd expect it > to be slower than Surafel's - the price you pay for flexibility) then I > can look at it further, but figuring out the planner stuff will take > some time. > > Other alternative can be pushing the existing implementation which will be open to change in case of better-finished implementation. regards Surafel