Repost from - devel.

Hi

We are ending up breaking up volumes artificially to support bacula,
because of the single threaded nature of the filedaemon. We have LTO6,
spooling and 10Gbit network. When a full backup end up spanning 3 weeks
run time - it get very very painfull. (Example below)

10-Feb 07:22 bacula-dir JobId 201646: Error: Bacula bacula-dir 7.0.5
(28Jul14):
  Build OS:               x86_64-pc-linux-gnu ubuntu 16.04
  JobId:                  201646
  Job:                    Abe_Daily_RTP.2019-02-01_21.03.30_01
  Backup Level:           Full (upgraded from Incremental)
  Client:                 "abe-fd" 7.0.5 (28Jul14)
x86_64-pc-linux-gnu,ubuntu,16.04
  FileSet:                "Abe Set RTP" 2019-01-16 21:03:01
  Pool:                   "Full-Pool" (From Job FullPool override)
Catalog:                "MyCatalog" (From Client resource)
  Storage:                "LTO-5" (From Job resource)
  Scheduled time:         01-Feb-2019 21:03:30
  Start time:             02-Feb-2019 05:38:30
  End time:               10-Feb-2019 07:22:30
  Elapsed time:           8 days 1 hour 44 mins
  Priority:               10
  FD Files Written:       3,096,049
  SD Files Written:       0
  FD Bytes Written:       3,222,203,306,821 (3.222 TB)
  SD Bytes Written:       0 (0 B)
  Rate:                   4620.0 KB/s
  Software Compression:   None
  VSS:                    no
  Encryption:             no
  Accurate:               no
  Volume name(s):
005641L5|005746L5|006211L5|006143L5|006125L5|006217L5|006221L5|005100L5|006158L5|006135L5|006175L5|006240L5|005291L5|006297L5|007543L6|007125L6|007180L6|007105L6|005538L5|005050L5|006254L5
  Volume Session Id:      3874
  Volume Session Time:    1544207587
  Last Volume Bytes:      1,964,015,354,880 (1.964 TB)
  Non-fatal FD errors:    1
  SD Errors:              0
  FD termination status:  Error
  SD termination status:  Running
  Termination:            *** Backup Error ***

Average filesize is 1MB here...

Underlying disk/filesystems are typically composed of 12/24/36 or more
spinning disks. Disk systems today really need parallelism to perform.

When dealing with large files, kernel readahead makes thing work nice, but
when someone dumps 100.000 2KB files it slows down to single disk iops
speed.

I have a few “suggestions” for how to get parallism into this.

1) When reading a catalog, loop over all files and issue a
posix_fadvise WILLNEED on the first 1MB of the file.

I have prototyped this outside bacula and it seem to work very
nicely and should be a small non-intrusive patch. It will allow the IO
stack to issue concurrently around the smaller files caching them in
memory.

2) Thread out the filedaemon
Implement a X MB buffer in the filedaemon. could be 16 slots of
max 5MB, for files smaller than 5MB this serves as staging area
for the thread, haning it over to the master process.
Yes, this can be tuned in a lot of ways, but most of us with large
filesystems would easily sacrifice 5-10GB memory on the server, just for
speeding up this stuff.

This is more intrusive but can be isolated fully to the filedaemon.

3) Allow a single job to spool/de-spool concurrently. Currently
spooling slows down individual job execution, but it would be
relatively simple just to assign 2 spool-buffers to the. One that
can be spooled into while the other despools and so on.

If someone is willing to help some of this along the way please let me
know and lets see if we can make ends meet.

Potentially others would like to co-fund here? I feel it unlikely
 that we are the only ones with the need.

Thanks.
-- 
Jesper




_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to