Re: [Bacula-users] False "Intervention needed" flood

Kern Sibbald Fri, 27 Sep 2019 08:10:06 -0700

Hello,

See below ...

On 9/26/19 6:22 PM, David Brodbeck wrote:

Part of the problem is it takes upwards of ten minutes for a job to fail when a workstation isn't available -- which is entirely correct, since the network connection has to time out.

Well actually the wait time has a default (currently 3 minutes), but it can be defined by the administrator using the FDConnectTimeout. At the current time, the value is "global" for all clients because it is set in the Director resource. At some future time, it might be worth while to allow a client by client connect timeout to be set.

However, the SD reservation is made *before* it tries to contact the FD, so I end up with resource starvation where jobs that are waiting to time out tie up resources that could be used by other jobs. I'm guessing the assumption is that clients will always be available, but the SD might be maxed out, so the code assumes it's more efficient not to contact a client until the director knows it has the resources to actually run the job.

For your particular case, though it is logical to attempt to acquire all the resources prior to connecting to the client, it might make a certain sense to connect to the FD before trying to reserve the devices.

One option would be to stagger the start times of my jobs so only the maximum the SD can handle get launched in any given 10 minute window, but that adds a lot of complexity to my configuration, since I currently can just allow JobDefs to pull in the schedule for all clients. I'd have to define start times individually, and maintain those in order to keep them balanced as I add/remove clients. Adding enough disks for the worst case isn't going to be possible. (I'm assuming one client per spindle is optimal for disk arrays -- maybe that's too conservative?)

I've just been putting up with the error messages rather than deal with the added maintenance of that approach. The extra alert emails can be dealt with by filtering my incoming email.

Best regards,
Kern

On Thu, Sep 26, 2019 at 1:28 AM Kern Sibbald <k...@sibbald.com> wrote:

Hello,

Bacula does already attempt to acquire the needed devices in the SD and
then backs them out if all the needed resources cannot be obtained.
This works quite nicely.   Consequently, while the job is waiting the
resources are released in the SD.

The problem occurs because the SD realizes that the resources are not
available, so it will wait a short period of time trying again to
acquire the resources, which is what one wants for virtually all jobs.
When it cannot acquire the resources the SD will fail the job. The
problem occurs because the user is over committing the SD resources.
The solution is to get more drives or modify how you run jobs.

From what I understand in this case is that the user has a large number
of jobs that regularly fail and thus the user explicitly over commits
the resources. The consequent is that Bacula works as it should but the
user gets lots of messages about the SD not being able to get resources.

Bacula was designed in a way were it expects to have the needed
resources available (i.e. the configuration should be optimized for the
available resources). It also handles the case where you over load the
SD (too many jobs for available resources), but in that case it will
warn you, which is exactly what 99% of all users want.

One possible solution would be to add a new directive that suppresses
the reservation failure message. However there is very likely a better
solution with the existing Bacula, I just do not know what it is at this
time. This is the first time in 19 years that this problem has come up,
so before changing anything in the code, it has to be very clearly
understood, which is not the case (at least for me).

Another solution is for the user to modify the source code and remove
the warning message.

Best regards,
Kern

On 9/25/19 10:50 AM, Andrea Venturoli wrote:
> On 2019-09-25 10:19, Radosław Korzeniewski wrote:
>> Hello,
>>
>> sob., 21 wrz 2019 o 00:52 David Brodbeck <brodb...@math.ucsb.edu
>> <mailto:brodb...@math.ucsb.edu>> napisał(a):
>>
>>     I think this is a somewhat unfortunate design decision, to be
>>     honest. (...)
>>
>>
>> So what should be the best design in this case which should solve the
>> problem?
>
> I'm not so into the code to tell for sure.
> Maybe rescheduling should release the SD once the job first fails and
> reserve again when it starts the next time?
>
> bye & Thanks
>     av.
>

--

David Brodbeck
System Administrator, Department of Mathematics
University of California, Santa Barbara
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] False "Intervention needed" flood

Reply via email to