[Bacula-devel] Sponsor development of client side (filedaemon) parallism.

jesper Sun, 10 Feb 2019 11:55:39 -0800

Hi

This has been an like-to-have for years, we are ending up braking up
volumes  artificially to support bacula, because of the single threaded
nature of the filedaemon. We have LTO6, spooling and 10Gbit network. When
a full backup end up spanning 3 weeks run time - it get very very
painfull. (Example below)

10-Feb 07:22 bacula-dir JobId 201646: Error: Bacula bacula-dir 7.0.5
(28Jul14):
Build OS: x86_64-pc-linux-gnu ubuntu 16.04
JobId: 201646
Job: Abe_Daily_RTP.2019-02-01_21.03.30_01
Backup Level: Full (upgraded from Incremental)
Client: "abe-fd" 7.0.5 (28Jul14)
x86_64-pc-linux-gnu,ubuntu,16.04
FileSet: "Abe Set RTP" 2019-01-16 21:03:01
Pool: "Full-Pool" (From Job FullPool override)
Catalog: "MyCatalog" (From Client resource)
Storage: "LTO-5" (From Job resource)
Scheduled time: 01-Feb-2019 21:03:30
Start time: 02-Feb-2019 05:38:30
End time: 10-Feb-2019 07:22:30
Elapsed time: 8 days 1 hour 44 mins
Priority: 10
FD Files Written: 3,096,049
SD Files Written: 0
FD Bytes Written: 3,222,203,306,821 (3.222 TB)
SD Bytes Written: 0 (0 B)
Rate: 4620.0 KB/s
Software Compression: None
VSS: no
Encryption: no
Accurate: no
Volume name(s):
005641L5|005746L5|006211L5|006143L5|006125L5|006217L5|006221L5|005100L5|006158L5|006135L5|006175L5|006240L5|005291L5|006297L5|007543L6|007125L6|007180L6|007105L6|005538L5|005050L5|006254L5
Volume Session Id: 3874
Volume Session Time: 1544207587
Last Volume Bytes: 1,964,015,354,880 (1.964 TB)
Non-fatal FD errors: 1
SD Errors: 0
FD termination status: Error
SD termination status: Running
Termination: *** Backup Error ***

Average filesize is 1MB here...

Underlying disk/filesystems are typically composed of 12/24/36 or more
spinning disks. Disk systems today really need parallelism to perform.

When dealing with large files, kernel readahead makes thing work nice, but
when someone dumps 100.000 2KB files it slows down to single disk iops
speed.

True single job parallelism would of course be awesome - multiple
spools, multiple drives, multiple streams over a single fileset.
But that is also very complex.

I have two suggestions for less intrusive benefits.

1) When reading a catalog, loop over all files and issue a
posix_fasvise WILLNEED on the first 1MB of the file.

I have prototyped this outside bacula and it seem to work very
nicely and should be a small non-intrusive patch. It will allow the IO
stack to issue concurrently around the smaller files caching them in
memory. I have inspected the sourcecode and cannot find traces that this
should be in place allready.

2) Thread out the filedaemon
Implement a X MB buffer in the filedaemon. could be 16 slots of
max 5MB, for files smaller than 5MB this serves as staging area
for the thread, haning it over to the master process.
Yes, this can be tuned in a lot of ways, but most of us with large
filesystems would easily sacrifice 5-10GB memory on the server, just for
speeding up this stuff.

This is more intrusive but can be isolated fully to the filedaemon.

If someone is willing to help some of this along the way please let me
know and lets see if we can make ends meet.

Potentially others would like to co-fund here? I feel it unlikely
that we are theonly ones with the need.

Basics of our installation ~10PB on tape, 0,5 PB live data under
backup, Quantum Scalar i6000 library with 6xLTO6 and 1100 slots.

Our current bacula catalog has survived since 2006ish and 5 LTO
generations - pretty impressive by itself.

Jesper

_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

[Bacula-devel] Sponsor development of client side (filedaemon) parallism.

Reply via email to