Dear Bacula developers, This chapter goes further. It seems there is not merely a problem with bacula-dir crashing if an FD is aggressively connecting, also the FD connection scheduling and connection aggressivity seems to need some more thought.
Today the bacula-dir crashed again. This time the other FD (this is Debian) went rampant, no apparent reason. It just connected to the dir like crazy. Something that also seems to need more adjustment is the connection scheduling. When an FD is running before the schedule starts, it will start connecting as scheduled, but it will never stop. I think it should stop when the schedule ends. When an FD is started during the schedule window, it will NOT start connecting to the director. This is clearly now what one would want. An started after the start time point and during the scheduled duration should connect to the director. Only if started before the scheduled time and after the duration of the schedule the FD shold not connect to the director. The director needs to be fixed so that it does not crash if an FD is reconnecting very aggressively. And FDs should be more limited in the connection frequency (this does not mean the parameter ReconnectionTime, but the time wait before initiating another connection after unsuccessful connection trial). All the best, J/C > On 2. Aug 2022, at 22:42, Justin Case <jus7inc...@gmail.com> wrote: > > This chapter was not yet closed and I have interesting further results > (actually a bug report). > I opened another thread here where bacula-dir was crashing. Today I found the > cause. > > A macOS FD was earlier using normal connection from bacula-dir, but later I > adapted it to connect to bacula-dir on its own. So far so good? No. Hang on, > I am coming to it. > > Under macOS I am using homebrew bacula-fd and in the meanwhile there was an > upgrade from 11 to 13. And at that point I made a subtle mistake: I copied > bacula-fd.conf from the 11 install to the 13 install. The problem with this > is that I overlooked this directive: > > Plugin Directory = /usr/local/Cellar/bacula-fd/11.0.6/lib > > I guess you immediately see where the problem is, the 13.0.0 bacula-fd binary > is using bpipe plugin 11.0.6 binary. > > The result of this seems to be that bacula-fd connects a lot more > aggressively to bacula-dir (don’t ask me why, I don’t know it). > > In an ideal world this shouldn’t pose a problem, but in the container for > bacula-dir 11 that I am using it caused bacula-dir to crash with a > segmentation violation. > > This is why I am reporting this. > > Since I corrected the library path I don’t see the crashes any more and no > aggressive quick and numerous connections from the fd to the dir. > > But! It is still the same director binary and there must be some bug if a > misbehaving FD can actually make the director crash. This is a security > problem, as someone could incapacitate the backups when they run and stop > doing that during the phase when backups are not running, so it would be hard > to troubleshoot. One could argue it is my own fault if I use v13 FDs with v11 > diriectors. From a security perspective it is a problem with the > implementation. As long as a v11 director accepts v13 clients, it must not > crash. > > I don’t know how to better report this, but some developer would probably > take a look where the problem is with bacula-dir crashing if it gets a lot of > FD connections quickly over a longer time. I have only tried it with v13 FDs, > but I suspect it would also happen with v11 FDs. > > All the best > J/C > >> On 25. Jul 2022, at 22:33, Justin Case <jus7inc...@gmail.com >> <mailto:jus7inc...@gmail.com>> wrote: >> >> I think it is great support here from you people! >> >> Today I think I might have understood what is happening, and Bill’s >> explanations about what might be going on were probably correct in the core, >> but not in the details. >> >> Let me try to lay out what I think is going on and where I had my problem >> understanding it in the first place: >> >> After an update of syslog-ng the syslogging of the FD client host started to >> work (it was configured a long time ago but somehow it never worked before >> the update for the syslog-ng server that came in the last days) I began to >> see where and WHEN(!) the error messages originated. >> >> It is - as you guys are saying - the FD generating these errors, which are >> logged without delay in my central syslog-ng server: >> >> 2022-07-25 12:57:30 >> bsockcore.c:265 Unable to connect to Director daemon on >> bacula-dir.lan.net:9101 <http://bacula-dir.lan.net:9101/>. ERR=Connection >> refused >> >> The eye-opener were the timestamps, which explained what is happening (more >> on that later). >> My problem so far was that the error messages shown in Baculum had the >> timestamp of the Director when the Director sees the error messages, not >> when they happened! >> >> 25-Jul 22:00 bacula-dir JobId 1725: Error: getmsg.c:217 Malformed message: >> [bsockcore.c:265 Unable to connect to Director daemon on >> bacula-dir.lan.net:9101 <http://bacula-dir.lan.net:9101/>. ERR=Connection >> refused >> >> Note the different timestamp. In the first message it is the timestamp of >> the FD client host when the error occurs there. In the second message you >> see the timestamp of the Director host when the first error message gets >> delivered from the FD to the Director. >> >> So what you guys said is correct: the Director accepts the error messaged >> from the FD only when a job runs for the FD. Even if the FD connects to the >> Director many times during the day, the error messages are held back by the >> FD until a job actually runs and then they are ingested for the first job >> that runs on the current day. This also explains why there are no errors >> when a similar job runs shortly after to backup to the other tier storage >> >> Because so far I was only seeing the Director timestamp I was misled that >> the error actually happens at the time when the job runs. I now understand >> that this is not correct, and I think you guys also mentioned it, but I >> didn’t pick it up consciously enough to understand what this means. >> >> Now that I can see the timestamp from the FD when the errors actually happen >> on the FD host I can now confirm: >> >> (1) the Director is definitely reachable for the FD at the time when the job >> runs (as I alway also stated), this is why the error messages show the >> timestamp of when the job runs, as it always is able to run due to >> availability of the Director. >> >> (2) the Director is NOT reachable at some scheduled times each day when the >> contained is shut down for third party backup reasons (the firewall has >> nothing to do with this). And this is the time frame when the errors >> actually occur and can now be seen in syslog-ng. >> >> I suppose if I now schedule the FD only connect to the Director when the job >> runs, the errors should go away. I will try this and report back. >> >> One last thing is still unclear to me. Today I saw 455 connection errors in >> the Baculum Messages window, but only 38 connection errors in syslog-ng. >> This is weird, as I am (1) using syslog over TCP, and (2) I think I should >> see a higher or the same number of connection errors in syslog-ng as >> compared to in Baculum Messages window. However it is the over way around >> and considerably more errors on the Director side than on the FD side >> (syslog). >> Can this be explained? >> >> All the best, >> J/C >> >> >>> On 25. Jul 2022, at 18:04, Martin Simmons <mar...@lispworks.com >>> <mailto:mar...@lispworks.com>> wrote: >>> >>>>>>>> On Mon, 25 Jul 2022 15:50:15 +0000, Bill Arlofski said: >>>> >>>> On Monday, July 25th, 2022 at 08:54, Martin Simmons <mar...@lispworks.com >>>> <mailto:mar...@lispworks.com>>= >>>> wrote: >>>>> >>>>> You could try running bacula-fd with debugging output. Unfortunately, >>>>> it doesn't include timestamps, but you can do it like this: >>>> >>>> Hey Martin, Not sure if this is recent or not, but: >>>> ----8<---- >>>> $ /opt/comm-bacula/sbin/bacula-fd -? >>>> Copyright (C) 2000-2022 Kern Sibbald. >>>> >>>> Version: 13.0.0 (04 July 2022) >>>> >>>> Usage: bacula-fd [-f -s] [-c config_file] [-d debug_level] >>>> -c <file> use <file> as configuration file >>>> -d <n>[,<tags>] set debug level to <nn>, debug tags to <tags> >>>> >>>> -dt print a timestamp in debug output <--= >>>> -- TimeStamps >>>> >>>> -f run in foreground (for debugging) >>>> -g groupid >>>> -k keep readall capabilities >>>> -m print kaboom output (for debugging) >>>> -P do not create pid file >>>> -s no signals (for debugging) >>>> -t test configuration file and exit >>>> -T set trace on >>>> -u userid >>>> -v verbose user messages >>>> -? print this message. >>>> ----8<---- >>> >>> Thanks, I didn't know that. >>> >>> So this will be simpler: >>> >>> bacula-fd -dt -d50,scheduler -f -v ...your normal bacula-fd args... >>> >>> __Martin >>> >>> >>> _______________________________________________ >>> Bacula-users mailing list >>> Bacula-users@lists.sourceforge.net >>> <mailto:Bacula-users@lists.sourceforge.net> >>> https://lists.sourceforge.net/lists/listinfo/bacula-users >>> <https://lists.sourceforge.net/lists/listinfo/bacula-users> >
_______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users