On 11/14/25 05:42, Chris Wright wrote:
Hello all,

I'm seeing some unexpectedly slow performance when testing a copy job process and I've pretty much run out of ideas on diagnosing it.

We are currently running two bacula storage daemons, on different VMs, and have been attempting to use copy jobs to take a copy of backups offsite.

SD1 is storing 100GiB volumes on a HDD backed Ceph pool (via file volumes on CephFS) - we have ~130 TiB of backups (no compression) in 100 GiB volume files, multiple jobs were run in parallel onto the volumes and we were getting >100 MiB/second write throughput (mostly client limited).

Multiple jobs in parallel wrote to the same volume? If so, then those jobs' data blocks are all interleaved with each other. The original parallel writes are not affected, but the copy job that is reading the interleaved records is constantly seeking and thrashing the HD read/write heads back and forth. You could write each job to a separate volume file, instead of having fixed size volumes.


SD2 is a Cloud store, using the S3 driver to push volumes up to S3/Glacier on a fast connection with a local SSD cache.

SD1 and SD2 are on the same 10 GiB switch, both have been recently upgraded to bacula 15.0.3 and both are on reasonably modern CPUs (AMD EPYC 9124 for SD1).

When we run a copy job we are seeing:
 - Expected backup jobs spawn
 - SD1 & SD2 connect to each other fine
 - SD1 mounts a volume file and starts streaming data to SD2, with reasonable throughput (50 - 100 MiB/sec)
all seems well for a time then throughput drops to essentially zero
 - SD1 will have a single CPU pegged at 100%, with minimal IO traffic (both ops and bandwidth) from the open volume file, we will get spikes of good speed but average throughput after leaving a job running for a week is <1 MiB/sec.  - SD2 is quiet, happily handling normal backup jobs from other clients with normal performance

If we start a second, parallel, copy job we get similar initially good throughput then peg a second CPU on SD1 to 100% but there isn't exactly a big jump in performance.

There are no warnings/errors being logged and everything appears to be "working", just glacially slow and apparently totally bottlenecked on whatever that single CPU thread is doing with minimal reads from the volumes.

Any suggestions on where to look for the root cause here?

Thanks
--

Chris Wright

Application Software Developer

<http://www.maglabs.net>


T:  0203 515 1000 | www.maglabs.net <http://www.maglabs.net> | Follow us <https://bit.ly/3x215vn>

MagLabs Limited is a Limited Liability Company registered at Companies House, Cardiff. Registration No 06715580. DISCLAIMER: This email and any attachments sent with it may contain confidential and legally privileged information. It is intended solely for the individual or entity to whom the email was addressed. If you are not the intended recipient please notify the sender via email immediately, delete the email (and attachments) from your computer system and destroy any copies you may have in your possession. You are prohibited from using, printing, copying or disclosing any of the information contained within the email and its attachment(s). MagLabs Limited does not accept any responsibility or liability for any changes made to this email after it was sent or for any viruses transmitted through it. Opinions, comments, and conclusions made in this email may be that of the author and may not reflect the view of MagLabs Limited.

Please consider the environment before printing this email



_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users
_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to