Hello,
See below ...
On 9/26/19 6:22 PM, David Brodbeck
wrote:
Part of the problem is it takes upwards of ten
minutes for a job to fail when a workstation isn't available
-- which is entirely correct, since the network connection has
to time out.
Well actually the wait time has a default (currently 3 minutes), but
it can be defined by the administrator using the FDConnectTimeout.
At the current time, the value is "global" for all clients because
it is set in the Director resource. At some future time, it might
be worth while to allow a client by client connect timeout to be
set.
However, the SD reservation is made *before* it
tries to contact the FD, so I end up with resource starvation
where jobs that are waiting to time out tie up resources that
could be used by other jobs. I'm guessing the assumption is
that clients will always be available, but the SD might be
maxed out, so the code assumes it's more efficient not to
contact a client until the director knows it has the resources
to actually run the job.
For your particular case, though it is logical to attempt to acquire
all the resources prior to connecting to the client, it might make a
certain sense to connect to the FD before trying to reserve the
devices.
One option would be to stagger the start times of my jobs
so only the maximum the SD can handle get launched in any
given 10 minute window, but that adds a lot of complexity to
my configuration, since I currently can just allow JobDefs
to pull in the schedule for all clients. I'd have to define
start times individually, and maintain those in order to
keep them balanced as I add/remove clients. Adding enough
disks for the worst case isn't going to be possible. (I'm
assuming one client per spindle is optimal for disk arrays
-- maybe that's too conservative?)
I've just been putting up with the error messages rather
than deal with the added maintenance of that approach. The
extra alert emails can be dealt with by filtering my
incoming email.
Best regards,
Kern
Hello,
Bacula does already attempt to acquire the needed devices in
the SD and
then backs them out if all the needed resources cannot be
obtained.
This works quite nicely. Consequently, while the job is
waiting the
resources are released in the SD.
The problem occurs because the SD realizes that the
resources are not
available, so it will wait a short period of time trying
again to
acquire the resources, which is what one wants for virtually
all jobs.
When it cannot acquire the resources the SD will fail the
job. The
problem occurs because the user is over committing the SD
resources.
The solution is to get more drives or modify how you run
jobs.
From what I understand in this case is that the user has a
large number
of jobs that regularly fail and thus the user explicitly
over commits
the resources. The consequent is that Bacula works as it
should but the
user gets lots of messages about the SD not being able to
get resources.
Bacula was designed in a way were it expects to have the
needed
resources available (i.e. the configuration should be
optimized for the
available resources). It also handles the case where you
over load the
SD (too many jobs for available resources), but in that case
it will
warn you, which is exactly what 99% of all users want.
One possible solution would be to add a new directive that
suppresses
the reservation failure message. However there is very
likely a better
solution with the existing Bacula, I just do not know what
it is at this
time. This is the first time in 19 years that this problem
has come up,
so before changing anything in the code, it has to be very
clearly
understood, which is not the case (at least for me).
Another solution is for the user to modify the source code
and remove
the warning message.
Best regards,
Kern
On 9/25/19 10:50 AM, Andrea Venturoli wrote:
> On 2019-09-25 10:19, Radosław Korzeniewski wrote:
>> Hello,
>>
>> sob., 21 wrz 2019 o 00:52 David Brodbeck <brodb...@math.ucsb.edu
>> <mailto:brodb...@math.ucsb.edu>>
napisał(a):
>>
>> I think this is a somewhat unfortunate design
decision, to be
>> honest. (...)
>>
>>
>> So what should be the best design in this case
which should solve the
>> problem?
>
> I'm not so into the code to tell for sure.
> Maybe rescheduling should release the SD once the job
first fails and
> reserve again when it starts the next time?
>
> bye & Thanks
> av.
>
--
David Brodbeck
System Administrator, Department of Mathematics
University of California, Santa Barbara
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users
|