Hi This has been an like-to-have for years, we are ending up braking up volumes artificially to support bacula, because of the single threaded nature of the filedaemon. We have LTO6, spooling and 10Gbit network. When a full backup end up spanning 3 weeks run time - it get very very painfull. (Example below)
10-Feb 07:22 bacula-dir JobId 201646: Error: Bacula bacula-dir 7.0.5 (28Jul14): Build OS: x86_64-pc-linux-gnu ubuntu 16.04 JobId: 201646 Job: Abe_Daily_RTP.2019-02-01_21.03.30_01 Backup Level: Full (upgraded from Incremental) Client: "abe-fd" 7.0.5 (28Jul14) x86_64-pc-linux-gnu,ubuntu,16.04 FileSet: "Abe Set RTP" 2019-01-16 21:03:01 Pool: "Full-Pool" (From Job FullPool override) Catalog: "MyCatalog" (From Client resource) Storage: "LTO-5" (From Job resource) Scheduled time: 01-Feb-2019 21:03:30 Start time: 02-Feb-2019 05:38:30 End time: 10-Feb-2019 07:22:30 Elapsed time: 8 days 1 hour 44 mins Priority: 10 FD Files Written: 3,096,049 SD Files Written: 0 FD Bytes Written: 3,222,203,306,821 (3.222 TB) SD Bytes Written: 0 (0 B) Rate: 4620.0 KB/s Software Compression: None VSS: no Encryption: no Accurate: no Volume name(s): 005641L5|005746L5|006211L5|006143L5|006125L5|006217L5|006221L5|005100L5|006158L5|006135L5|006175L5|006240L5|005291L5|006297L5|007543L6|007125L6|007180L6|007105L6|005538L5|005050L5|006254L5 Volume Session Id: 3874 Volume Session Time: 1544207587 Last Volume Bytes: 1,964,015,354,880 (1.964 TB) Non-fatal FD errors: 1 SD Errors: 0 FD termination status: Error SD termination status: Running Termination: *** Backup Error *** Average filesize is 1MB here... Underlying disk/filesystems are typically composed of 12/24/36 or more spinning disks. Disk systems today really need parallelism to perform. When dealing with large files, kernel readahead makes thing work nice, but when someone dumps 100.000 2KB files it slows down to single disk iops speed. True single job parallelism would of course be awesome - multiple spools, multiple drives, multiple streams over a single fileset. But that is also very complex. I have two suggestions for less intrusive benefits. 1) When reading a catalog, loop over all files and issue a posix_fasvise WILLNEED on the first 1MB of the file. I have prototyped this outside bacula and it seem to work very nicely and should be a small non-intrusive patch. It will allow the IO stack to issue concurrently around the smaller files caching them in memory. I have inspected the sourcecode and cannot find traces that this should be in place allready. 2) Thread out the filedaemon Implement a X MB buffer in the filedaemon. could be 16 slots of max 5MB, for files smaller than 5MB this serves as staging area for the thread, haning it over to the master process. Yes, this can be tuned in a lot of ways, but most of us with large filesystems would easily sacrifice 5-10GB memory on the server, just for speeding up this stuff. This is more intrusive but can be isolated fully to the filedaemon. If someone is willing to help some of this along the way please let me know and lets see if we can make ends meet. Potentially others would like to co-fund here? I feel it unlikely that we are theonly ones with the need. Basics of our installation ~10PB on tape, 0,5 PB live data under backup, Quantum Scalar i6000 library with 6xLTO6 and 1100 slots. Our current bacula catalog has survived since 2006ish and 5 LTO generations - pretty impressive by itself. Jesper _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel