On 11/14/25 05:42, Chris Wright wrote:
Hello all,
I'm seeing some unexpectedly slow performance when testing a copy job
process and I've pretty much run out of ideas on diagnosing it.
We are currently running two bacula storage daemons, on different VMs,
and have been attempting to use copy jobs to take a copy of backups
offsite.
SD1 is storing 100GiB volumes on a HDD backed Ceph pool (via file
volumes on CephFS) - we have ~130 TiB of backups (no compression) in
100 GiB volume files, multiple jobs were run in parallel onto the
volumes and we were getting >100 MiB/second write throughput (mostly
client limited).
Multiple jobs in parallel wrote to the same volume? If so, then those
jobs' data blocks are all interleaved with each other. The original
parallel writes are not affected, but the copy job that is reading the
interleaved records is constantly seeking and thrashing the HD
read/write heads back and forth. You could write each job to a separate
volume file, instead of having fixed size volumes.
SD2 is a Cloud store, using the S3 driver to push volumes up to
S3/Glacier on a fast connection with a local SSD cache.
SD1 and SD2 are on the same 10 GiB switch, both have been
recently upgraded to bacula 15.0.3 and both are on reasonably modern
CPUs (AMD EPYC 9124 for SD1).
When we run a copy job we are seeing:
- Expected backup jobs spawn
- SD1 & SD2 connect to each other fine
- SD1 mounts a volume file and starts streaming data to SD2, with
reasonable throughput (50 - 100 MiB/sec)
all seems well for a time then throughput drops to essentially zero
- SD1 will have a single CPU pegged at 100%, with minimal IO traffic
(both ops and bandwidth) from the open volume file, we will get spikes
of good speed but average throughput after leaving a job running for a
week is <1 MiB/sec.
- SD2 is quiet, happily handling normal backup jobs from other
clients with normal performance
If we start a second, parallel, copy job we get similar initially good
throughput then peg a second CPU on SD1 to 100% but there isn't
exactly a big jump in performance.
There are no warnings/errors being logged and everything appears to be
"working", just glacially slow and apparently totally bottlenecked on
whatever that single CPU thread is doing with minimal reads from the
volumes.
Any suggestions on where to look for the root cause here?
Thanks
--
Chris Wright
Application Software Developer
<http://www.maglabs.net>
T: 0203 515 1000 | www.maglabs.net <http://www.maglabs.net> | Follow
us <https://bit.ly/3x215vn>
MagLabs Limited is a Limited Liability Company registered at Companies
House, Cardiff. Registration No 06715580.
DISCLAIMER: This email and any attachments sent with it may contain
confidential and legally privileged information. It is intended solely
for the individual or entity to whom the email was addressed. If you
are not the intended recipient please notify the sender via email
immediately, delete the email (and attachments) from your computer
system and destroy any copies you may have in your possession. You are
prohibited from using, printing, copying or disclosing any of the
information contained within the email and its attachment(s). MagLabs
Limited does not accept any responsibility or liability for any
changes made to this email after it was sent or for any viruses
transmitted through it. Opinions, comments, and conclusions made in
this email may be that of the author and may not reflect the view of
MagLabs Limited.
Please consider the environment before printing this email
_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users
_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users