my-ship-it commented on issue #1648: URL: https://github.com/apache/cloudberry/issues/1648#issuecomment-4159303366
Hi, thanks for reporting this issue. Based on the logs you shared, the key clue is the `Connection refused` error on `sky-cbseg04:50002`. This means a primary segment went down while `pg_basebackup` was still streaming data from it. When `pg_basebackup` loses its connection mid-transfer, it automatically removes the partially-copied data directory to avoid leaving behind an incomplete/corrupt copy — that's the "removing data directory" behavior you're seeing. With 16 primary segments (~1TB each based on the progress output), running all `pg_basebackup` processes simultaneously puts enormous pressure on the system — heavy I/O, WAL generation, and network traffic across all hosts at once. This likely caused one or more primary segments to crash, which then cascaded into failures for the mirrors being built from them. **Suggested troubleshooting steps:** 1. **Check primary segment logs** on `sky-cbseg04` (port 50002) for crash/OOM/restart entries around the time of the failure. 2. **Check `dmesg`** on all segment hosts for OOM killer activity. 3. **Verify `max_wal_senders`** — run `gpconfig -s max_wal_senders` and ensure the value is sufficient (at least 2 per segment). 4. **Check disk space** — during `pg_basebackup`, WAL files accumulate and can fill up disks. **Recommended workaround:** Reduce the parallelism to avoid overwhelming the cluster. Instead of recovering all segments at once, try: ```bash gprecoverseg -B 4 -b 2 ``` - `-B 4`: only process 4 hosts in parallel at the coordinator level (default is 16) - `-b 2`: only run 2 `pg_basebackup` processes per host (default is 64) This will take longer but should prevent the resource exhaustion that causes primaries to crash mid-backup. For `gpaddmirrors`, the same flags (`-B` and `-b`) are available to control parallelism. Hope this helps! Let us know what you find in the primary segment logs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
