my-ship-it commented on issue #1648:
URL: https://github.com/apache/cloudberry/issues/1648#issuecomment-4159303366

   Hi, thanks for reporting this issue.
   
   Based on the logs you shared, the key clue is the `Connection refused` error 
on `sky-cbseg04:50002`. This means a primary segment went down while 
`pg_basebackup` was still streaming data from it. When `pg_basebackup` loses 
its connection mid-transfer, it automatically removes the partially-copied data 
directory to avoid leaving behind an incomplete/corrupt copy — that's the 
"removing data directory" behavior you're seeing.
   
   With 16 primary segments (~1TB each based on the progress output), running 
all `pg_basebackup` processes simultaneously puts enormous pressure on the 
system — heavy I/O, WAL generation, and network traffic across all hosts at 
once. This likely caused one or more primary segments to crash, which then 
cascaded into failures for the mirrors being built from them.
   
   **Suggested troubleshooting steps:**
   
   1. **Check primary segment logs** on `sky-cbseg04` (port 50002) for 
crash/OOM/restart entries around the time of the failure.
   2. **Check `dmesg`** on all segment hosts for OOM killer activity.
   3. **Verify `max_wal_senders`** — run `gpconfig -s max_wal_senders` and 
ensure the value is sufficient (at least 2 per segment).
   4. **Check disk space** — during `pg_basebackup`, WAL files accumulate and 
can fill up disks.
   
   **Recommended workaround:**
   
   Reduce the parallelism to avoid overwhelming the cluster. Instead of 
recovering all segments at once, try:
   
   ```bash
   gprecoverseg -B 4 -b 2
   ```
   
   - `-B 4`: only process 4 hosts in parallel at the coordinator level (default 
is 16)
   - `-b 2`: only run 2 `pg_basebackup` processes per host (default is 64)
   
   This will take longer but should prevent the resource exhaustion that causes 
primaries to crash mid-backup.
   
   For `gpaddmirrors`, the same flags (`-B` and `-b`) are available to control 
parallelism.
   
   Hope this helps! Let us know what you find in the primary segment logs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to