Hi Asif The backup failed with errors "error: could not connect to server: could not look up local user ID 1000: Too many open files" when the max_wal_senders was set to 2000. The errors generated for the workers starting from backup worke=1017. Please note that the backup directory was also not cleaned after the backup was failed.
Steps ======= 1) Generate data in DB ./pgbench -i -s 600 -h localhost -p 5432 postgres 2) Set max_wal_senders = 2000 in postgresql. 3) Generate the backup [edb@localhost bin]$ ^[[A[edb@localhost bin]$ [edb@localhost bin]$ ./pg_basebackup -v -j 1990 -D /home/edb/Desktop/backup/ pg_basebackup: initiating base backup, waiting for checkpoint to complete pg_basebackup: checkpoint completed pg_basebackup: write-ahead log start point: 1/F1000028 on timeline 1 pg_basebackup: starting background WAL receiver pg_basebackup: created temporary replication slot "pg_basebackup_58692" pg_basebackup: backup worker (0) created …. ….. ….. pg_basebackup: backup worker (1017) created pg_basebackup: error: could not connect to server: could not look up local user ID 1000: Too many open files pg_basebackup: backup worker (1018) created pg_basebackup: error: could not connect to server: could not look up local user ID 1000: Too many open files … … … pg_basebackup: error: could not connect to server: could not look up local user ID 1000: Too many open files pg_basebackup: backup worker (1989) created pg_basebackup: error: could not create file "/home/edb/Desktop/backup//global/4183": Too many open files pg_basebackup: error: could not create file "/home/edb/Desktop/backup//global/3592": Too many open files pg_basebackup: error: could not create file "/home/edb/Desktop/backup//global/4177": Too many open files [edb@localhost bin]$ 4) The backup directory is not cleaned [edb@localhost bin]$ [edb@localhost bin]$ ls /home/edb/Desktop/backup base pg_commit_ts pg_logical pg_notify pg_serial pg_stat pg_subtrans pg_twophase pg_xact global pg_dynshmem pg_multixact pg_replslot pg_snapshots pg_stat_tmp pg_tblspc pg_wal [edb@localhost bin]$ Kashif Zeeshan EnterpriseDB On Thu, Apr 2, 2020 at 2:58 PM Rajkumar Raghuwanshi < rajkumar.raghuwan...@enterprisedb.com> wrote: > Hi Asif, > > My colleague Kashif Zeeshan reported an issue off-list, posting here, > please take a look. > > When executing two backups at the same time, getting FATAL error due to > max_wal_senders and instead of exit Backup got completed > And when tried to start the server from the backup cluster, getting error. > > [edb@localhost bin]$ ./pgbench -i -s 200 -h localhost -p 5432 postgres > [edb@localhost bin]$ ./pg_basebackup -v -j 8 -D /home/edb/Desktop/backup/ > pg_basebackup: initiating base backup, waiting for checkpoint to complete > pg_basebackup: checkpoint completed > pg_basebackup: write-ahead log start point: 0/C2000270 on timeline 1 > pg_basebackup: starting background WAL receiver > pg_basebackup: created temporary replication slot "pg_basebackup_57849" > pg_basebackup: backup worker (0) created > pg_basebackup: backup worker (1) created > pg_basebackup: backup worker (2) created > pg_basebackup: error: could not connect to server: FATAL: number of > requested standby connections exceeds max_wal_senders (currently 10) > pg_basebackup: backup worker (3) created > pg_basebackup: error: could not connect to server: FATAL: number of > requested standby connections exceeds max_wal_senders (currently 10) > pg_basebackup: backup worker (4) created > pg_basebackup: error: could not connect to server: FATAL: number of > requested standby connections exceeds max_wal_senders (currently 10) > pg_basebackup: backup worker (5) created > pg_basebackup: error: could not connect to server: FATAL: number of > requested standby connections exceeds max_wal_senders (currently 10) > pg_basebackup: backup worker (6) created > pg_basebackup: error: could not connect to server: FATAL: number of > requested standby connections exceeds max_wal_senders (currently 10) > pg_basebackup: backup worker (7) created > pg_basebackup: write-ahead log end point: 0/C3000050 > pg_basebackup: waiting for background process to finish streaming ... > pg_basebackup: syncing data to disk ... > pg_basebackup: base backup completed > [edb@localhost bin]$ ./pg_basebackup -v -j 8 -D > /home/edb/Desktop/backup1/ > pg_basebackup: initiating base backup, waiting for checkpoint to complete > pg_basebackup: checkpoint completed > pg_basebackup: write-ahead log start point: 0/C20001C0 on timeline 1 > pg_basebackup: starting background WAL receiver > pg_basebackup: created temporary replication slot "pg_basebackup_57848" > pg_basebackup: backup worker (0) created > pg_basebackup: backup worker (1) created > pg_basebackup: backup worker (2) created > pg_basebackup: error: could not connect to server: FATAL: number of > requested standby connections exceeds max_wal_senders (currently 10) > pg_basebackup: backup worker (3) created > pg_basebackup: error: could not connect to server: FATAL: number of > requested standby connections exceeds max_wal_senders (currently 10) > pg_basebackup: backup worker (4) created > pg_basebackup: error: could not connect to server: FATAL: number of > requested standby connections exceeds max_wal_senders (currently 10) > pg_basebackup: backup worker (5) created > pg_basebackup: error: could not connect to server: FATAL: number of > requested standby connections exceeds max_wal_senders (currently 10) > pg_basebackup: backup worker (6) created > pg_basebackup: error: could not connect to server: FATAL: number of > requested standby connections exceeds max_wal_senders (currently 10) > pg_basebackup: backup worker (7) created > pg_basebackup: write-ahead log end point: 0/C2000348 > pg_basebackup: waiting for background process to finish streaming ... > pg_basebackup: syncing data to disk ... > pg_basebackup: base backup completed > > [edb@localhost bin]$ ./pg_ctl -D /home/edb/Desktop/backup1/ -o "-p 5438" > start > pg_ctl: directory "/home/edb/Desktop/backup1" is not a database cluster > directory > > Thanks & Regards, > Rajkumar Raghuwanshi > > > On Mon, Mar 30, 2020 at 6:28 PM Ahsan Hadi <ahsan.h...@gmail.com> wrote: > >> >> >> On Mon, Mar 30, 2020 at 3:44 PM Rajkumar Raghuwanshi < >> rajkumar.raghuwan...@enterprisedb.com> wrote: >> >>> Thanks Asif, >>> >>> I have re-verified reported issue. expect standby backup, others are >>> fixed. >>> >> >> Yes As Asif mentioned he is working on the standby issue and adding >> bandwidth throttling functionality to parallel backup. >> >> It would be good to get some feedback on Asif previous email from Robert >> on the design considerations for stand-by server support and throttling. I >> believe all the other points mentioned by Robert in this thread are >> addressed by Asif so it would be good to hear about any other concerns that >> are not addressed. >> >> Thanks, >> >> -- Ahsan >> >> >>> Thanks & Regards, >>> Rajkumar Raghuwanshi >>> >>> >>> On Fri, Mar 27, 2020 at 11:04 PM Asif Rehman <asifr.reh...@gmail.com> >>> wrote: >>> >>>> >>>> >>>> On Wed, Mar 25, 2020 at 12:22 PM Rajkumar Raghuwanshi < >>>> rajkumar.raghuwan...@enterprisedb.com> wrote: >>>> >>>>> Hi Asif, >>>>> >>>>> While testing further I observed parallel backup is not able to take >>>>> backup of standby server. >>>>> >>>>> mkdir /tmp/archive_dir >>>>> echo "archive_mode='on'">> data/postgresql.conf >>>>> echo "archive_command='cp %p /tmp/archive_dir/%f'">> >>>>> data/postgresql.conf >>>>> >>>>> ./pg_ctl -D data -l logs start >>>>> ./pg_basebackup -p 5432 -Fp -R -D /tmp/slave >>>>> >>>>> echo "primary_conninfo='host=127.0.0.1 port=5432 user=edb'">> >>>>> /tmp/slave/postgresql.conf >>>>> echo "restore_command='cp /tmp/archive_dir/%f %p'">> >>>>> /tmp/slave/postgresql.conf >>>>> echo "promote_trigger_file='/tmp/failover.log'">> >>>>> /tmp/slave/postgresql.conf >>>>> >>>>> ./pg_ctl -D /tmp/slave -l /tmp/slave_logs -o "-p 5433" start -c >>>>> >>>>> [edb@localhost bin]$ ./psql postgres -p 5432 -c "select >>>>> pg_is_in_recovery();" >>>>> pg_is_in_recovery >>>>> ------------------- >>>>> f >>>>> (1 row) >>>>> >>>>> [edb@localhost bin]$ ./psql postgres -p 5433 -c "select >>>>> pg_is_in_recovery();" >>>>> pg_is_in_recovery >>>>> ------------------- >>>>> t >>>>> (1 row) >>>>> >>>>> >>>>> >>>>> >>>>> *[edb@localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs >>>>> 6pg_basebackup: error: could not list backup files: ERROR: the standby >>>>> was >>>>> promoted during online backupHINT: This means that the backup being taken >>>>> is corrupt and should not be used. Try taking another online >>>>> backup.pg_basebackup: removing data directory "/tmp/bkp_s"* >>>>> >>>>> #same is working fine without parallel backup >>>>> [edb@localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs 1 >>>>> [edb@localhost bin]$ ls /tmp/bkp_s/PG_VERSION >>>>> /tmp/bkp_s/PG_VERSION >>>>> >>>>> Thanks & Regards, >>>>> Rajkumar Raghuwanshi >>>>> >>>>> >>>>> On Thu, Mar 19, 2020 at 4:11 PM Rajkumar Raghuwanshi < >>>>> rajkumar.raghuwan...@enterprisedb.com> wrote: >>>>> >>>>>> Hi Asif, >>>>>> >>>>>> In another scenarios, bkp data is corrupted for tablespace. again >>>>>> this is not reproducible everytime, >>>>>> but If I am running the same set of commands I am getting the same >>>>>> error. >>>>>> >>>>>> [edb@localhost bin]$ ./pg_ctl -D data -l logfile start >>>>>> waiting for server to start.... done >>>>>> server started >>>>>> [edb@localhost bin]$ >>>>>> [edb@localhost bin]$ mkdir /tmp/tblsp >>>>>> [edb@localhost bin]$ ./psql postgres -p 5432 -c "create tablespace >>>>>> tblsp location '/tmp/tblsp';" >>>>>> CREATE TABLESPACE >>>>>> [edb@localhost bin]$ ./psql postgres -p 5432 -c "create database >>>>>> testdb tablespace tblsp;" >>>>>> CREATE DATABASE >>>>>> [edb@localhost bin]$ ./psql testdb -p 5432 -c "create table testtbl >>>>>> (a text);" >>>>>> CREATE TABLE >>>>>> [edb@localhost bin]$ ./psql testdb -p 5432 -c "insert into testtbl >>>>>> values ('parallel_backup with tablespace');" >>>>>> INSERT 0 1 >>>>>> [edb@localhost bin]$ ./pg_basebackup -p 5432 -D /tmp/bkp -T >>>>>> /tmp/tblsp=/tmp/tblsp_bkp --jobs 2 >>>>>> [edb@localhost bin]$ ./pg_ctl -D /tmp/bkp -l /tmp/bkp_logs -o "-p >>>>>> 5555" start >>>>>> waiting for server to start.... done >>>>>> server started >>>>>> [edb@localhost bin]$ ./psql postgres -p 5555 -c "select * from >>>>>> pg_tablespace where spcname like 'tblsp%' or spcname = 'pg_default'"; >>>>>> oid | spcname | spcowner | spcacl | spcoptions >>>>>> -------+------------+----------+--------+------------ >>>>>> 1663 | pg_default | 10 | | >>>>>> 16384 | tblsp | 10 | | >>>>>> (2 rows) >>>>>> >>>>>> [edb@localhost bin]$ ./psql testdb -p 5555 -c "select * from >>>>>> testtbl"; >>>>>> psql: error: could not connect to server: FATAL: >>>>>> "pg_tblspc/16384/PG_13_202003051/16385" is not a valid data directory >>>>>> DETAIL: File "pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION" is >>>>>> missing. >>>>>> [edb@localhost bin]$ >>>>>> [edb@localhost bin]$ ls >>>>>> data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION >>>>>> data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION >>>>>> [edb@localhost bin]$ ls >>>>>> /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION >>>>>> ls: cannot access >>>>>> /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION: No such file >>>>>> or >>>>>> directory >>>>>> >>>>>> >>>>>> Thanks & Regards, >>>>>> Rajkumar Raghuwanshi >>>>>> >>>>>> >>>>>> On Mon, Mar 16, 2020 at 6:19 PM Rajkumar Raghuwanshi < >>>>>> rajkumar.raghuwan...@enterprisedb.com> wrote: >>>>>> >>>>>>> Hi Asif, >>>>>>> >>>>>>> On testing further, I found when taking backup with -R, >>>>>>> pg_basebackup crashed >>>>>>> this crash is not consistently reproducible. >>>>>>> >>>>>>> [edb@localhost bin]$ ./psql postgres -p 5432 -c "create table test >>>>>>> (a text);" >>>>>>> CREATE TABLE >>>>>>> [edb@localhost bin]$ ./psql postgres -p 5432 -c "insert into test >>>>>>> values ('parallel_backup with -R recovery-conf');" >>>>>>> INSERT 0 1 >>>>>>> [edb@localhost bin]$ ./pg_basebackup -p 5432 -j 2 -D >>>>>>> /tmp/test_bkp/bkp -R >>>>>>> Segmentation fault (core dumped) >>>>>>> >>>>>>> stack trace looks the same as it was on earlier reported crash with >>>>>>> tablespace. >>>>>>> --stack trace >>>>>>> [edb@localhost bin]$ gdb -q -c core.37915 pg_basebackup >>>>>>> Loaded symbols for /lib64/libnss_files.so.2 >>>>>>> Core was generated by `./pg_basebackup -p 5432 -j 2 -D >>>>>>> /tmp/test_bkp/bkp -R'. >>>>>>> Program terminated with signal 11, Segmentation fault. >>>>>>> #0 0x00000000004099ee in worker_get_files (wstate=0xc1e458) at >>>>>>> pg_basebackup.c:3175 >>>>>>> 3175 backupinfo->curr = fetchfile->next; >>>>>>> Missing separate debuginfos, use: debuginfo-install >>>>>>> keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 >>>>>>> libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 >>>>>>> openssl-1.0.1e-58.el6_10.x86_64 zlib-1.2.3-29.el6.x86_64 >>>>>>> (gdb) bt >>>>>>> #0 0x00000000004099ee in worker_get_files (wstate=0xc1e458) at >>>>>>> pg_basebackup.c:3175 >>>>>>> #1 0x0000000000408a9e in worker_run (arg=0xc1e458) at >>>>>>> pg_basebackup.c:2715 >>>>>>> #2 0x0000003921a07aa1 in start_thread (arg=0x7f72207c0700) at >>>>>>> pthread_create.c:301 >>>>>>> #3 0x00000039212e8c4d in clone () at >>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 >>>>>>> (gdb) >>>>>>> >>>>>>> Thanks & Regards, >>>>>>> Rajkumar Raghuwanshi >>>>>>> >>>>>>> >>>>>>> On Mon, Mar 16, 2020 at 2:14 PM Jeevan Chalke < >>>>>>> jeevan.cha...@enterprisedb.com> wrote: >>>>>>> >>>>>>>> Hi Asif, >>>>>>>> >>>>>>>> >>>>>>>>> Thanks Rajkumar. I have fixed the above issues and have rebased >>>>>>>>> the patch to the latest master (b7f64c64). >>>>>>>>> (V9 of the patches are attached). >>>>>>>>> >>>>>>>> >>>>>>>> I had a further review of the patches and here are my few >>>>>>>> observations: >>>>>>>> >>>>>>>> 1. >>>>>>>> +/* >>>>>>>> + * stop_backup() - ends an online backup >>>>>>>> + * >>>>>>>> + * The function is called at the end of an online backup. It sends >>>>>>>> out pg_control >>>>>>>> + * file, optionally WAL segments and ending WAL location. >>>>>>>> + */ >>>>>>>> >>>>>>>> Comments seem out-dated. >>>>>>>> >>>>>>> >>>> Fixed. >>>> >>>> >>>>> >>>>>>>> 2. With parallel jobs, maxrate is now not supported. Since we are >>>>>>>> now asking >>>>>>>> data in multiple threads throttling seems important here. Can you >>>>>>>> please >>>>>>>> explain why have you disabled that? >>>>>>>> >>>>>>>> 3. As we are always fetching a single file and as Robert suggested, >>>>>>>> let rename >>>>>>>> SEND_FILES to SEND_FILE instead. >>>>>>>> >>>>>>> >>>> Yes, we are fetching a single file. However, SEND_FILES is still >>>> capable of fetching multiple files in one >>>> go, that's why the name. >>>> >>>> >>>>>>>> 4. Does this work on Windows? I mean does pthread_create() work on >>>>>>>> Windows? >>>>>>>> I asked this as I see that pgbench has its own implementation for >>>>>>>> pthread_create() for WIN32 but this patch doesn't. >>>>>>>> >>>>>>> >>>> patch is updated to add support for the Windows platform. >>>> >>>> >>>>>>>> 5. Typos: >>>>>>>> tablspace => tablespace >>>>>>>> safly => safely >>>>>>>> >>>>>>>> Done. >>>> >>>> >>>>> 6. parallel_backup_run() needs some comments explaining the states it >>>>>>>> goes >>>>>>>> through PB_* states. >>>>>>>> >>>>>>>> 7. >>>>>>>> + case PB_FETCH_REL_FILES: /* fetch files from server >>>>>>>> */ >>>>>>>> + if (backupinfo->activeworkers == 0) >>>>>>>> + { >>>>>>>> + backupinfo->backupstate = PB_STOP_BACKUP; >>>>>>>> + free_filelist(backupinfo); >>>>>>>> + } >>>>>>>> + break; >>>>>>>> + case PB_FETCH_WAL_FILES: /* fetch WAL files from >>>>>>>> server */ >>>>>>>> + if (backupinfo->activeworkers == 0) >>>>>>>> + { >>>>>>>> + backupinfo->backupstate = PB_BACKUP_COMPLETE; >>>>>>>> + } >>>>>>>> + break; >>>>>>>> >>>>>>> Done. >>>> >>>> >>>>> >>>>>>>> Why free_filelist() is not called in PB_FETCH_WAL_FILES case? >>>>>>>> >>>>>>> Done. >>>> >>>> The corrupted tablespace and crash, reported by Rajkumar, have been >>>> fixed. A pointer >>>> variable remained uninitialized which in turn caused the system to >>>> misbehave. >>>> >>>> Attached is the updated set of patches. AFAIK, to complete parallel >>>> backup feature >>>> set, there remain three sub-features: >>>> >>>> 1- parallel backup does not work with a standby server. In parallel >>>> backup, the server >>>> spawns multiple processes and there is no shared state being >>>> maintained. So currently, >>>> no way to tell multiple processes if the standby was promoted during >>>> the backup since >>>> the START_BACKUP was called. >>>> >>>> 2- throttling. Robert previously suggested that we implement >>>> throttling on the client-side. >>>> However, I found a previous discussion where it was advocated to be >>>> added to the >>>> backend instead[1]. >>>> >>>> So, it was better to have a consensus before moving the throttle >>>> function to the client. >>>> That’s why for the time being I have disabled it and have asked for >>>> suggestions on it >>>> to move forward. >>>> >>>> It seems to me that we have to maintain a shared state in order to >>>> support taking backup >>>> from standby. Also, there is a new feature recently committed for >>>> backup progress >>>> reporting in the backend (pg_stat_progress_basebackup). This >>>> functionality was recently >>>> added via this commit ID: e65497df. For parallel backup to update these >>>> stats, a shared >>>> state will be required. >>>> >>>> Since multiple pg_basebackup can be running at the same time, >>>> maintaining a shared state >>>> can become a little complex, unless we disallow taking multiple >>>> parallel backups. >>>> >>>> So proceeding on with this patch, I will be working on: >>>> - throttling to be implemented on the client-side. >>>> - adding a shared state to handle backup from the standby. >>>> >>>> >>>> >>>> [1] >>>> https://www.postgresql.org/message-id/flat/521B4B29.20009%402ndquadrant.com#189bf840c87de5908c0b4467d31b50af >>>> >>>> >>>> -- >>>> Asif Rehman >>>> Highgo Software (Canada/China/Pakistan) >>>> URL : www.highgo.ca >>>> >>>> >> >> -- >> Highgo Software (Canada/China/Pakistan) >> URL : http://www.highgo.ca >> ADDR: 10318 WHALLEY BLVD, Surrey, BC >> EMAIL: mailto: ahsan.h...@highgo.ca >> > -- Regards ==================================== Kashif Zeeshan Lead Quality Assurance Engineer / Manager EnterpriseDB Corporation The Enterprise Postgres Company