Hi Asif, My colleague Kashif Zeeshan reported an issue off-list, posting here, please take a look.
When executing two backups at the same time, getting FATAL error due to max_wal_senders and instead of exit Backup got completed And when tried to start the server from the backup cluster, getting error. [edb@localhost bin]$ ./pgbench -i -s 200 -h localhost -p 5432 postgres [edb@localhost bin]$ ./pg_basebackup -v -j 8 -D /home/edb/Desktop/backup/ pg_basebackup: initiating base backup, waiting for checkpoint to complete pg_basebackup: checkpoint completed pg_basebackup: write-ahead log start point: 0/C2000270 on timeline 1 pg_basebackup: starting background WAL receiver pg_basebackup: created temporary replication slot "pg_basebackup_57849" pg_basebackup: backup worker (0) created pg_basebackup: backup worker (1) created pg_basebackup: backup worker (2) created pg_basebackup: error: could not connect to server: FATAL: number of requested standby connections exceeds max_wal_senders (currently 10) pg_basebackup: backup worker (3) created pg_basebackup: error: could not connect to server: FATAL: number of requested standby connections exceeds max_wal_senders (currently 10) pg_basebackup: backup worker (4) created pg_basebackup: error: could not connect to server: FATAL: number of requested standby connections exceeds max_wal_senders (currently 10) pg_basebackup: backup worker (5) created pg_basebackup: error: could not connect to server: FATAL: number of requested standby connections exceeds max_wal_senders (currently 10) pg_basebackup: backup worker (6) created pg_basebackup: error: could not connect to server: FATAL: number of requested standby connections exceeds max_wal_senders (currently 10) pg_basebackup: backup worker (7) created pg_basebackup: write-ahead log end point: 0/C3000050 pg_basebackup: waiting for background process to finish streaming ... pg_basebackup: syncing data to disk ... pg_basebackup: base backup completed [edb@localhost bin]$ ./pg_basebackup -v -j 8 -D /home/edb/Desktop/backup1/ pg_basebackup: initiating base backup, waiting for checkpoint to complete pg_basebackup: checkpoint completed pg_basebackup: write-ahead log start point: 0/C20001C0 on timeline 1 pg_basebackup: starting background WAL receiver pg_basebackup: created temporary replication slot "pg_basebackup_57848" pg_basebackup: backup worker (0) created pg_basebackup: backup worker (1) created pg_basebackup: backup worker (2) created pg_basebackup: error: could not connect to server: FATAL: number of requested standby connections exceeds max_wal_senders (currently 10) pg_basebackup: backup worker (3) created pg_basebackup: error: could not connect to server: FATAL: number of requested standby connections exceeds max_wal_senders (currently 10) pg_basebackup: backup worker (4) created pg_basebackup: error: could not connect to server: FATAL: number of requested standby connections exceeds max_wal_senders (currently 10) pg_basebackup: backup worker (5) created pg_basebackup: error: could not connect to server: FATAL: number of requested standby connections exceeds max_wal_senders (currently 10) pg_basebackup: backup worker (6) created pg_basebackup: error: could not connect to server: FATAL: number of requested standby connections exceeds max_wal_senders (currently 10) pg_basebackup: backup worker (7) created pg_basebackup: write-ahead log end point: 0/C2000348 pg_basebackup: waiting for background process to finish streaming ... pg_basebackup: syncing data to disk ... pg_basebackup: base backup completed [edb@localhost bin]$ ./pg_ctl -D /home/edb/Desktop/backup1/ -o "-p 5438" start pg_ctl: directory "/home/edb/Desktop/backup1" is not a database cluster directory Thanks & Regards, Rajkumar Raghuwanshi On Mon, Mar 30, 2020 at 6:28 PM Ahsan Hadi <ahsan.h...@gmail.com> wrote: > > > On Mon, Mar 30, 2020 at 3:44 PM Rajkumar Raghuwanshi < > rajkumar.raghuwan...@enterprisedb.com> wrote: > >> Thanks Asif, >> >> I have re-verified reported issue. expect standby backup, others are >> fixed. >> > > Yes As Asif mentioned he is working on the standby issue and adding > bandwidth throttling functionality to parallel backup. > > It would be good to get some feedback on Asif previous email from Robert > on the design considerations for stand-by server support and throttling. I > believe all the other points mentioned by Robert in this thread are > addressed by Asif so it would be good to hear about any other concerns that > are not addressed. > > Thanks, > > -- Ahsan > > >> Thanks & Regards, >> Rajkumar Raghuwanshi >> >> >> On Fri, Mar 27, 2020 at 11:04 PM Asif Rehman <asifr.reh...@gmail.com> >> wrote: >> >>> >>> >>> On Wed, Mar 25, 2020 at 12:22 PM Rajkumar Raghuwanshi < >>> rajkumar.raghuwan...@enterprisedb.com> wrote: >>> >>>> Hi Asif, >>>> >>>> While testing further I observed parallel backup is not able to take >>>> backup of standby server. >>>> >>>> mkdir /tmp/archive_dir >>>> echo "archive_mode='on'">> data/postgresql.conf >>>> echo "archive_command='cp %p /tmp/archive_dir/%f'">> >>>> data/postgresql.conf >>>> >>>> ./pg_ctl -D data -l logs start >>>> ./pg_basebackup -p 5432 -Fp -R -D /tmp/slave >>>> >>>> echo "primary_conninfo='host=127.0.0.1 port=5432 user=edb'">> >>>> /tmp/slave/postgresql.conf >>>> echo "restore_command='cp /tmp/archive_dir/%f %p'">> >>>> /tmp/slave/postgresql.conf >>>> echo "promote_trigger_file='/tmp/failover.log'">> >>>> /tmp/slave/postgresql.conf >>>> >>>> ./pg_ctl -D /tmp/slave -l /tmp/slave_logs -o "-p 5433" start -c >>>> >>>> [edb@localhost bin]$ ./psql postgres -p 5432 -c "select >>>> pg_is_in_recovery();" >>>> pg_is_in_recovery >>>> ------------------- >>>> f >>>> (1 row) >>>> >>>> [edb@localhost bin]$ ./psql postgres -p 5433 -c "select >>>> pg_is_in_recovery();" >>>> pg_is_in_recovery >>>> ------------------- >>>> t >>>> (1 row) >>>> >>>> >>>> >>>> >>>> *[edb@localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs >>>> 6pg_basebackup: error: could not list backup files: ERROR: the standby was >>>> promoted during online backupHINT: This means that the backup being taken >>>> is corrupt and should not be used. Try taking another online >>>> backup.pg_basebackup: removing data directory "/tmp/bkp_s"* >>>> >>>> #same is working fine without parallel backup >>>> [edb@localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs 1 >>>> [edb@localhost bin]$ ls /tmp/bkp_s/PG_VERSION >>>> /tmp/bkp_s/PG_VERSION >>>> >>>> Thanks & Regards, >>>> Rajkumar Raghuwanshi >>>> >>>> >>>> On Thu, Mar 19, 2020 at 4:11 PM Rajkumar Raghuwanshi < >>>> rajkumar.raghuwan...@enterprisedb.com> wrote: >>>> >>>>> Hi Asif, >>>>> >>>>> In another scenarios, bkp data is corrupted for tablespace. again this >>>>> is not reproducible everytime, >>>>> but If I am running the same set of commands I am getting the same >>>>> error. >>>>> >>>>> [edb@localhost bin]$ ./pg_ctl -D data -l logfile start >>>>> waiting for server to start.... done >>>>> server started >>>>> [edb@localhost bin]$ >>>>> [edb@localhost bin]$ mkdir /tmp/tblsp >>>>> [edb@localhost bin]$ ./psql postgres -p 5432 -c "create tablespace >>>>> tblsp location '/tmp/tblsp';" >>>>> CREATE TABLESPACE >>>>> [edb@localhost bin]$ ./psql postgres -p 5432 -c "create database >>>>> testdb tablespace tblsp;" >>>>> CREATE DATABASE >>>>> [edb@localhost bin]$ ./psql testdb -p 5432 -c "create table testtbl >>>>> (a text);" >>>>> CREATE TABLE >>>>> [edb@localhost bin]$ ./psql testdb -p 5432 -c "insert into testtbl >>>>> values ('parallel_backup with tablespace');" >>>>> INSERT 0 1 >>>>> [edb@localhost bin]$ ./pg_basebackup -p 5432 -D /tmp/bkp -T >>>>> /tmp/tblsp=/tmp/tblsp_bkp --jobs 2 >>>>> [edb@localhost bin]$ ./pg_ctl -D /tmp/bkp -l /tmp/bkp_logs -o "-p >>>>> 5555" start >>>>> waiting for server to start.... done >>>>> server started >>>>> [edb@localhost bin]$ ./psql postgres -p 5555 -c "select * from >>>>> pg_tablespace where spcname like 'tblsp%' or spcname = 'pg_default'"; >>>>> oid | spcname | spcowner | spcacl | spcoptions >>>>> -------+------------+----------+--------+------------ >>>>> 1663 | pg_default | 10 | | >>>>> 16384 | tblsp | 10 | | >>>>> (2 rows) >>>>> >>>>> [edb@localhost bin]$ ./psql testdb -p 5555 -c "select * from testtbl"; >>>>> psql: error: could not connect to server: FATAL: >>>>> "pg_tblspc/16384/PG_13_202003051/16385" is not a valid data directory >>>>> DETAIL: File "pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION" is >>>>> missing. >>>>> [edb@localhost bin]$ >>>>> [edb@localhost bin]$ ls >>>>> data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION >>>>> data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION >>>>> [edb@localhost bin]$ ls >>>>> /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION >>>>> ls: cannot access >>>>> /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION: No such file or >>>>> directory >>>>> >>>>> >>>>> Thanks & Regards, >>>>> Rajkumar Raghuwanshi >>>>> >>>>> >>>>> On Mon, Mar 16, 2020 at 6:19 PM Rajkumar Raghuwanshi < >>>>> rajkumar.raghuwan...@enterprisedb.com> wrote: >>>>> >>>>>> Hi Asif, >>>>>> >>>>>> On testing further, I found when taking backup with -R, pg_basebackup >>>>>> crashed >>>>>> this crash is not consistently reproducible. >>>>>> >>>>>> [edb@localhost bin]$ ./psql postgres -p 5432 -c "create table test >>>>>> (a text);" >>>>>> CREATE TABLE >>>>>> [edb@localhost bin]$ ./psql postgres -p 5432 -c "insert into test >>>>>> values ('parallel_backup with -R recovery-conf');" >>>>>> INSERT 0 1 >>>>>> [edb@localhost bin]$ ./pg_basebackup -p 5432 -j 2 -D >>>>>> /tmp/test_bkp/bkp -R >>>>>> Segmentation fault (core dumped) >>>>>> >>>>>> stack trace looks the same as it was on earlier reported crash with >>>>>> tablespace. >>>>>> --stack trace >>>>>> [edb@localhost bin]$ gdb -q -c core.37915 pg_basebackup >>>>>> Loaded symbols for /lib64/libnss_files.so.2 >>>>>> Core was generated by `./pg_basebackup -p 5432 -j 2 -D >>>>>> /tmp/test_bkp/bkp -R'. >>>>>> Program terminated with signal 11, Segmentation fault. >>>>>> #0 0x00000000004099ee in worker_get_files (wstate=0xc1e458) at >>>>>> pg_basebackup.c:3175 >>>>>> 3175 backupinfo->curr = fetchfile->next; >>>>>> Missing separate debuginfos, use: debuginfo-install >>>>>> keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 >>>>>> libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 >>>>>> openssl-1.0.1e-58.el6_10.x86_64 zlib-1.2.3-29.el6.x86_64 >>>>>> (gdb) bt >>>>>> #0 0x00000000004099ee in worker_get_files (wstate=0xc1e458) at >>>>>> pg_basebackup.c:3175 >>>>>> #1 0x0000000000408a9e in worker_run (arg=0xc1e458) at >>>>>> pg_basebackup.c:2715 >>>>>> #2 0x0000003921a07aa1 in start_thread (arg=0x7f72207c0700) at >>>>>> pthread_create.c:301 >>>>>> #3 0x00000039212e8c4d in clone () at >>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 >>>>>> (gdb) >>>>>> >>>>>> Thanks & Regards, >>>>>> Rajkumar Raghuwanshi >>>>>> >>>>>> >>>>>> On Mon, Mar 16, 2020 at 2:14 PM Jeevan Chalke < >>>>>> jeevan.cha...@enterprisedb.com> wrote: >>>>>> >>>>>>> Hi Asif, >>>>>>> >>>>>>> >>>>>>>> Thanks Rajkumar. I have fixed the above issues and have rebased the >>>>>>>> patch to the latest master (b7f64c64). >>>>>>>> (V9 of the patches are attached). >>>>>>>> >>>>>>> >>>>>>> I had a further review of the patches and here are my few >>>>>>> observations: >>>>>>> >>>>>>> 1. >>>>>>> +/* >>>>>>> + * stop_backup() - ends an online backup >>>>>>> + * >>>>>>> + * The function is called at the end of an online backup. It sends >>>>>>> out pg_control >>>>>>> + * file, optionally WAL segments and ending WAL location. >>>>>>> + */ >>>>>>> >>>>>>> Comments seem out-dated. >>>>>>> >>>>>> >>> Fixed. >>> >>> >>>> >>>>>>> 2. With parallel jobs, maxrate is now not supported. Since we are >>>>>>> now asking >>>>>>> data in multiple threads throttling seems important here. Can you >>>>>>> please >>>>>>> explain why have you disabled that? >>>>>>> >>>>>>> 3. As we are always fetching a single file and as Robert suggested, >>>>>>> let rename >>>>>>> SEND_FILES to SEND_FILE instead. >>>>>>> >>>>>> >>> Yes, we are fetching a single file. However, SEND_FILES is still capable >>> of fetching multiple files in one >>> go, that's why the name. >>> >>> >>>>>>> 4. Does this work on Windows? I mean does pthread_create() work on >>>>>>> Windows? >>>>>>> I asked this as I see that pgbench has its own implementation for >>>>>>> pthread_create() for WIN32 but this patch doesn't. >>>>>>> >>>>>> >>> patch is updated to add support for the Windows platform. >>> >>> >>>>>>> 5. Typos: >>>>>>> tablspace => tablespace >>>>>>> safly => safely >>>>>>> >>>>>>> Done. >>> >>> >>>> 6. parallel_backup_run() needs some comments explaining the states it >>>>>>> goes >>>>>>> through PB_* states. >>>>>>> >>>>>>> 7. >>>>>>> + case PB_FETCH_REL_FILES: /* fetch files from server >>>>>>> */ >>>>>>> + if (backupinfo->activeworkers == 0) >>>>>>> + { >>>>>>> + backupinfo->backupstate = PB_STOP_BACKUP; >>>>>>> + free_filelist(backupinfo); >>>>>>> + } >>>>>>> + break; >>>>>>> + case PB_FETCH_WAL_FILES: /* fetch WAL files from >>>>>>> server */ >>>>>>> + if (backupinfo->activeworkers == 0) >>>>>>> + { >>>>>>> + backupinfo->backupstate = PB_BACKUP_COMPLETE; >>>>>>> + } >>>>>>> + break; >>>>>>> >>>>>> Done. >>> >>> >>>> >>>>>>> Why free_filelist() is not called in PB_FETCH_WAL_FILES case? >>>>>>> >>>>>> Done. >>> >>> The corrupted tablespace and crash, reported by Rajkumar, have been >>> fixed. A pointer >>> variable remained uninitialized which in turn caused the system to >>> misbehave. >>> >>> Attached is the updated set of patches. AFAIK, to complete parallel >>> backup feature >>> set, there remain three sub-features: >>> >>> 1- parallel backup does not work with a standby server. In parallel >>> backup, the server >>> spawns multiple processes and there is no shared state being maintained. >>> So currently, >>> no way to tell multiple processes if the standby was promoted during the >>> backup since >>> the START_BACKUP was called. >>> >>> 2- throttling. Robert previously suggested that we implement >>> throttling on the client-side. >>> However, I found a previous discussion where it was advocated to be >>> added to the >>> backend instead[1]. >>> >>> So, it was better to have a consensus before moving the throttle >>> function to the client. >>> That’s why for the time being I have disabled it and have asked for >>> suggestions on it >>> to move forward. >>> >>> It seems to me that we have to maintain a shared state in order to >>> support taking backup >>> from standby. Also, there is a new feature recently committed for backup >>> progress >>> reporting in the backend (pg_stat_progress_basebackup). This >>> functionality was recently >>> added via this commit ID: e65497df. For parallel backup to update these >>> stats, a shared >>> state will be required. >>> >>> Since multiple pg_basebackup can be running at the same time, >>> maintaining a shared state >>> can become a little complex, unless we disallow taking multiple parallel >>> backups. >>> >>> So proceeding on with this patch, I will be working on: >>> - throttling to be implemented on the client-side. >>> - adding a shared state to handle backup from the standby. >>> >>> >>> >>> [1] >>> https://www.postgresql.org/message-id/flat/521B4B29.20009%402ndquadrant.com#189bf840c87de5908c0b4467d31b50af >>> >>> >>> -- >>> Asif Rehman >>> Highgo Software (Canada/China/Pakistan) >>> URL : www.highgo.ca >>> >>> > > -- > Highgo Software (Canada/China/Pakistan) > URL : http://www.highgo.ca > ADDR: 10318 WHALLEY BLVD, Surrey, BC > EMAIL: mailto: ahsan.h...@highgo.ca >