Re: [Bacula-users] Concurrent Job Behaviour

Kern Sibbald Wed, 18 May 2005 15:10:33 -0700

On Wednesday 18 May 2005 23:24, Sean O'Grady wrote:
> Hi,
>
> Good points on a number of things but a few comments need to be made.
>
> 1) I'm not attempting to use spooling as a backup method that I want to
> restore from. I'm using spooling as its intended for, to avoid
> "shoe-shining". I backup a number of clients servers at remote sites and
> their network connections can sometimes be saturated while backing up.
> With the spooling the tape is only moving when it needs to be which is
> good for the wear and tear :) There is multiple clients and I'm trying
> to use different pools of tapes for them which is whats got me into this
> predicament. In terms of spool size and running out of space that is a
> consideration even in the current version so with some careful
> management this problem could be avoided.
>
> 2) I don't believe writing to a disk based volume and then migrating to
> tape would work for me. For restores wouldn't that require me to first
> restore the disk based volume from to tape to disk then restore the
> files I need from that disk volume (which is really its own Storage
> Device)? I haven't really looked into this scenario but the bits that I
> have read led me to believe that the restore scenario would be like that.


Actually this would probably work quite well, because while the data is on 
disk Bacula would restore it from disk, and when it is on tape Bacula would 
restore it from tape. 

However, this is not currently implemented and thus is not possible.

>
> 3) In terms of waiting for a Volume to be inserted I have the luxury of
> having a tape auto-loader doing the work for me. In my proposed scenario
> Bacula could check to see what Volume it requires  as the Job finishes
> its spooling and if the tape is not in the drive it could issue an
> mtx-changer command and have the autoloader load it. I see some
> potential issues here with timing and deadlocking for the drive between
> jobs but some careful queue management could ensure this works.

>
> 4) Your absolutely right about Bacula positioning the tape before the
> Job starts. Looking at the code I see that before data spooling begins
> only after Bacula acquires the Storage Device which wouldn't work so
> well with Jobs needing Volumes from multiple Pools. With how the checks
> are currently working in terms of getting the "ok" to start  the job
> wouldn't end up being too much different (I say with a wink), possibly
> some shuffling around in the order of the sub-routines?? In reality I
> think this is where the major part of the work would be needed, since
> there is potential for some major failure here.

Well, the idea of obtaining a tape drive at the last minute is interesting, 
and I'm going to think about it carefully, but my intuition tells me it is  
dangerous.  You could have 10 jobs partially completed all waiting for one 
tape drive.  This could bring your server to its knees in terms of resource 
usage (especially disk space).

>
> With all this being said I'll diffently bring it up next time Kern asks
> for "wish-list" suggestions. In the meantime I can simply do away with
> the multiple pools and make sure that same Level Jobs happen in the same
> time frames and I should have the behaviour that I want minus the
> separate Pools.

Yes, or get more tape drives. :-)

>
> Thanks everyone for your help!
>
> Sean
>
> Arno Lehmann wrote:
> > Hello,
> >
> > Sean O'Grady wrote:
> >> Hi,
> >
> > ...
> >
> >> I believe I have sorted out what my issue with this is. As I didn't
> >> post my complete configs and only the ones that I thought would be
> >> relevant I ended up only giving half the picture. What was missing
> >> was that there is another set of Pool tapes and different Jobs that
> >> run using these Pools (that also do data spooling) at the same time
> >> as the Jobs I showed before.
> >
> > Ok, so this explains it.
> >
> >> Looking at src/dird/jobq.c I see the following which hopefully Kern
> >> or someone else in touch with the code can enlighten a bit more for me.
> >
> > Well, I'm not in touch with the code, but still...
> >
> >>  >SNIP
> >>
> >> if (njcr->store == jcr->store && njcr->pool != jcr->pool) {
> >>     skip_this_jcr = true;
> >>     break;
> >> }
> >>
> >>  >SNIP
> >>
> >> This says to me that as long as the Pools of the Jobs being queued
> >> match, the Jobs will all run concurrently. Jobs however that have
> >> mismatching Pools will instead queue and wait for the storage device
> >> to free when previous jobs complete.
> >
> > That's about it, I'd say.
> >
> >> Its probably not this simple but some behaviour equivalent to ...
> >>
> >> if (njcr->store == jcr->store && njcr->pool != jcr->pool &&
> >> njcr->spool_data != true) {
> >>     skip_this_jcr = true;
> >>     break;
> >> }
> >>
> >> ... sould allow for Jobs to queue with different Pools that have
> >> spooling on.
> >
> > Your ideamight be possible, but there are someother things to consider.
> > One is that bacula positions the tape *before* starting a job, i.e.
> > bwfore starting to spool data.
> >
> > I was wondering about this, but I can see some good reason as well. I
> > guess that Kern's idea was that a job should only run when everything
> > indicates that it can run at all.
> >
> > So, making sure tape space is available is one important preparation.
> >
> >> To ensure that Jobs complete some further checks of the storage
> >> daemon and the director that -
> >>
> >> 1) when spooling from the client completes is the Storage device
> >> available for append
> >> 2) if the Storage device is availble is a Pool object suitable for
> >> this Job currently loaded (if not load it)
> >> 3) when the Job completes check the status of Jobs queued and grab
> >> the next Job where the spooling is complete goto 2) again
> >
> > Although I can see advantages in your scenario I also see some
> > disadvantages.
> >
> > Spool space is one important thing - allowing jobs to spool without
> > being sure when they will be despooled can use up much or even all of
> > your disk space, thus preventing jobs from running smooth that
> > otherwise could run fine.
> >
> > Then, I think it's a good idea to have jobs finish as soon as
> > possible, with would not be the case if they started, spooled data,
> > and then had to wait for someone to insert the right volume. Bacula
> > keeps open some network connections with each job, so it even wastes
> > resources (although this should not be a serious problem).
> >
> > Finally, I think spooling as bacula does it now is not the best
> > approach to your needs. A spooled job is not available for restore and
> > not considered done, so it's not yet a useful backup. A better
> > approach would be to backup to a disk based volume first, and later
> > migrate the job to tape.
> >
> >> My question now changes to "Is there a way for Jobs to run
> >> Concurrently that use different Pools as long as the Job Definitions
> >> are set to Spool Data" as outlined in the example above (or something
> >> similiar) ?
> >>
> >> Or of course maybe Bacula can already handle this and I'm just
> >> missing it :)
> >
> > This time you're not :-)
> >
> > But, considering that Kern seems to have the development version in a
> > state that approaches beta stability, I assume he will release the
> > version 1.38 in the next few months.
> >
> > After that, he will probably ask for feature requests and suggestions.
> > This would be the best time to present your ideas once more.
> >
> > Anyway, I'd vote for job migration :-)
> >
> > Arno
> >
> >> Thanks,
> >> Sean
> >>
> >> Arno Lehmann wrote:
> >>> Hi.
> >>>
> >>> Sean O'Grady wrote:
> >>>> Well its good to know that Bacula will do what I need!
> >>>>
> >>>> Guess now I need to determine what I've done wrong in my configs ...
> >>>>
> >>>> I'm short forming all the config inforation to reduce the size of
> >>>> the e-mail but I can post my full configs if necessary. Anywhere
> >>>> where I have "Maximum Concurrent Jobs" I've posted that section of
> >>>> the config. If there is something else besides "Maximum Concurrent
> >>>> Jobs" needed in the configs to get this behaviour to happen and I'm
> >>>> missing it, please let me know.
> >>>
> >>> The short form is ok :-)
> >>>
> >>> Now, after reading through it I actually don't see any reason why
> >>> only one job at a time is run.
> >>>
> >>> Perhaps someone else can...
> >>>
> >>> Still, I have some questions.
> >>> First, which version of bacula do you use?
> >>> Then, do you perhaps use job overrides concerning the pools or the
> >>> priorities in your schedule?
> >>> And, finally, are all the jobs scheduled to run at the same level,
> >>> e.g. full, and do they actually do so? Perhaps you have a job
> >>> running at Full level, and the others are scheduled to run
> >>> incremental, so they have to wait for the right media (of pool
> >>> DailyPool).
> >>>
> >>> Arno
> >>>
> >>>> Any suggestions appreciated!
> >>>>
> >>>> Sean
> >>>>
> >>>> In bacula-dir.conf ...
> >>>>
> >>>> Director {
> >>>>  Name = mobinet-dir1
> >>>>  DIRport = 9101                # where we listen for UA connections
> >>>>  QueryFile = "/etc/bacula/query.sql"
> >>>>  WorkingDirectory = "/data/bacula/working"
> >>>>  PidDirectory = "/var/run"
> >>>>  Maximum Concurrent Jobs = 10
> >>>>  Password = "****"         # Console password
> >>>>  Messages = Daemon
> >>>> }
> >>>>
> >>>> JobDefs {
> >>>>   Name = "MobinetDef"
> >>>>   Storage = polaris-sd
> >>>>   Schedule = "Mobinet-Cycle"
> >>>>   Type = Backup
> >>>>   Max Start Delay = 32400 # 9 hours
> >>>>   Max Run Time = 14400 # 4 hours
> >>>>   Rerun Failed Levels = yes
> >>>>   Maximum Concurrent Jobs = 5
> >>>>   Reschedule On Error = yes
> >>>>   Reschedule Interval = 3600
> >>>>   Reschedule Times = 2
> >>>>   Priority = 10
> >>>>   Messages = Standard
> >>>>   Pool = Default
> >>>>   Incremental Backup Pool = MobinetDailyPool
> >>>>   Differential Backup Pool = MobinetWeeklyPool
> >>>>   Full Backup Pool = MobinetMonthlyPool
> >>>>   SpoolData = yes
> >>>> }
> >>>>
> >>>> JobDefs {
> >>>>   Name = "SiriusWebDef"
> >>>>   Storage = polaris-sd
> >>>>   Schedule = "SiriusWeb-Cycle"
> >>>>   Type = Backup
> >>>>   Max Start Delay = 32400 # 9 hours
> >>>>   Max Run Time = 14400 # 4 hours
> >>>>   Rerun Failed Levels = yes
> >>>>   Maximum Concurrent Jobs = 5
> >>>>   Reschedule On Error = yes
> >>>>   Reschedule Interval = 3600
> >>>>   Reschedule Times = 2
> >>>>   Priority = 10
> >>>>   Messages = Standard
> >>>>   Pool = Default
> >>>>   Incremental Backup Pool = MobinetDailyPool
> >>>>   Differential Backup Pool = MobinetWeeklyPool
> >>>>   Full Backup Pool = MobinetMonthlyPool
> >>>>   SpoolData = yes
> >>>> }
> >>>>
> >>>> Storage {
> >>>>  Name = polaris-sd
> >>>>  Address = "****"
> >>>>  SDPort = 9103
> >>>>  Password = "****"
> >>>>  Device = "PowerVault 122T VS80"
> >>>>  Media Type = DLTIV
> >>>>  Maximum Concurrent Jobs = 10
> >>>> }
> >>>>
> >>>> In bacula-sd.conf
> >>>>
> >>>> Storage {                             # definition of myself
> >>>>  Name = polaris-sd
> >>>>  SDPort = 9103                  # Director's port
> >>>> WorkingDirectory = "/data/bacula/working"
> >>>>  Pid Directory = "/var/run"
> >>>>  Maximum Concurrent Jobs = 10
> >>>> }
> >>>>
> >>>> Device {
> >>>>   Name = "PowerVault 122T VS80"
> >>>>   Media Type = DLTIV
> >>>>   Archive Device = /dev/nst0
> >>>>   Changer Device = /dev/sg1
> >>>>   Changer Command = "/etc/bacula/mtx-changer %c %o %S %a"
> >>>>   AutoChanger = yes
> >>>>   AutomaticMount = yes               # when device opened, read it
> >>>>   AlwaysOpen = yes
> >>>>   LabelMedia = no
> >>>>   Spool Directory = /data/bacula/spool
> >>>>   Maximum Spool Size = 14G
> >>>> }
> >>>>
> >>>> In bacula-fd.conf on all the clients
> >>>>
> >>>> FileDaemon {                          # this is me
> >>>>  Name = polaris-mobinet-ca
> >>>>  FDport = 9102                  # where we listen for the director
> >>>>  WorkingDirectory = /data/bacula/working
> >>>>  Pid Directory = /var/run
> >>>>  Maximum Concurrent Jobs = 10
> >>>> }
> >>>>
> >>>> Arno Lehmann wrote:
> >>>>> Hello,
> >>>>>
> >>>>> Sean O'Grady wrote:
> >>>>> ...
> >>>>>
> >>>>>> As an alternative which would be even better - All 5 Jobs start @
> >>>>>> 23:00 spooling data from the client, the first Job to complete
> >>>>>> the spooling from the client starts writing to the Storage
> >>>>>> Device. Remaining Jobs queue for the Storage Device as it becomes
> >>>>>> available and as their spooling completes.
> >>>>>>
> >>>>>> Instead what I'm seeing is while the first job executes the
> >>>>>> additional jobs all have a status of "is waiting on max Storage
> >>>>>> jobs" and will not begin spooling their data until that first Job
> >>>>>> has spooled->despooled->written to the Storage Device.
> >>>>>>
> >>>>>> My question is of course "is this possible" to have Concurrent
> >>>>>> Jobs running and spooling in one of the scenarios above (or
> >>>>>> another I'm missing).
> >>>>>
> >>>>> Well, I guess that this must be a setup problem on your side -
> >>>>> after all, this is what I'm doing here and it works (apart from
> >>>>> very few cases where jobs are held that *could* start, but I
> >>>>> couldn't find out why yet).
> >>>>>
> >>>>> From your description, I assume that you forgot to set "Maximum
> >>>>> Concurrent Jobs" in all the necessary places, namely in the
> >>>>> storage definitions.
> >>>>>
> >>>>> I noticed that the same message is printed when the director has
> >>>>> to wait for a client, though. (This is not yet confirmed, noticed
> >>>>> it only yesterday and couldn't verify it yet).
> >>>>>
> >>>>>> If so I'll send out more details of my config to see if anyone
> >>>>>> can point out what I'm doing wrong.
> >>>>>
> >>>>> First, verify the settings you have - there are directives in the
> >>>>> client's config, the sd config, and the director configuration
> >>>>> where you need to apply the right settings for your setup.
> >>>>>
> >>>>> Arno
> >>>>>
> >>>>>> Thanks,
> >>>>>> Sean
> >>>>>>
> >>>>>> --
> >>>>>> Sean O'Grady
> >>>>>> System Administrator
> >>>>>> Sheridan College
> >>>>>> Oakville, Ontario
> >>>>>>
> >>>>>>
> >>>>>> -------------------------------------------------------
> >>>>>> This SF.Net email is sponsored by Oracle Space Sweepstakes
> >>>>>> Want to be the first software developer in space?
> >>>>>> Enter now for the Oracle Space Sweepstakes!
> >>>>>> http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click
> >>>>>> _______________________________________________
> >>>>>> Bacula-users mailing list
> >>>>>> Bacula-users@lists.sourceforge.net
> >>>>>> https://lists.sourceforge.net/lists/listinfo/bacula-users
> >>
> >> -------------------------------------------------------
> >> This SF.Net email is sponsored by Oracle Space Sweepstakes
> >> Want to be the first software developer in space?
> >> Enter now for the Oracle Space Sweepstakes!
> >> http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click
> >> _______________________________________________
> >> Bacula-users mailing list
> >> Bacula-users@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/bacula-users
>
> -------------------------------------------------------
> This SF.Net email is sponsored by Oracle Space Sweepstakes
> Want to be the first software developer in space?
> Enter now for the Oracle Space Sweepstakes!
> http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click
> _______________________________________________
> Bacula-users mailing list
> Bacula-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bacula-users

-- 
Best regards,

Kern

  (">
  /\
  V_V


-------------------------------------------------------
This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Concurrent Job Behaviour

Reply via email to