I spent considerable time yesterday moving my bareos dir host from centos7 to debian 12. Ran my usual set of jobs last night and still got 5 jobs that failed with "Fatal error: filed/backup.cc:1616 Network send error to SD. ERR=Broken pipe". This only started happening in the last 6 weeks, since I stood up a new fd host. None of my other fd hosts are triggering this error. When I manually re-run these failed jobs, they usually complete fine, though yesterday, I tried to rerun one three times and it never finished successfully. Both hosts are running the latest version available from official bareos repos: 23.0.4~pre113.6ea98eb40-106. I need some additional troubleshooting and debugging help with this. Debug logs aren't really showing anything useful.
Thanks. Seth On Wednesday, July 10, 2024 at 9:32:40 AM UTC-5 Seth Galitzer wrote: > I've been running my dir and sd on a centos7 (I know, it's old) host, > upgrading bareos regularly. It's been processing jobs just fine from fd > hosts running a variety of debian and ubuntu releases, as well as another > centos7 host. I recently moved jobs from the centos7 fd to a new one > running debian 12 (bookworm), also running the latest bareos release. Since > then, jobs have been randomly failing from that host only. > > I would get job reports with messages like this: > 05-Jul 20:00 imperial-dir JobId 60064: Fatal error: Network error with FD > during Backup: ERR=Connection reset by peer 05-Jul 20:00 imperial-dir JobId > 60064: Fatal error: Director's comm line to SD dropped. 05-Jul 20:00 > imperial-dir JobId 60064: Fatal error: No Job status returned from FD. > 05-Jul 20:00 imperial-dir JobId 60064: Insert of attributes batch table > with 323847 entries start 05-Jul 20:00 imperial-dir JobId 60064: Insert of > attributes batch table done 05-Jul 20:00 imperial-dir JobId 60064: Error: > Bareos imperial-dir 23.0.4~pre61.010c81fdc (03Jul24): > > Essentially, it looks like the job would run to completion, but then never > send the final OK back to the director, eventually time out and then > trigger this error. When I first setup the new fd host, this was happening > for every job. After doing a bit of research, I added "Heartbeat Interval > = 60" to the client config on the dir. Since then, most of the jobs have > been completing, but 5 out of about 30 still fail. Upon re-running those > jobs manually, sometimes 1 still fails, but the rest succeed. > > Now, my job reports have errors like this: > 10-Jul 03:51 files-fd JobId 60268: Fatal error: filed/dir_cmd.cc:2423 Comm > error with SD. bad response to Append Data. ERR=Connection reset by peer > 10-Jul 03:51 imperial-dir JobId 60268: Fatal error: Director's comm line to > SD dropped. 10-Jul 03:51 imperial-dir JobId 60268: Error: Bareos > imperial-dir 23.0.4~pre64.caca3169f (05Jul24): > > I turned on trace debugging for the dir, sd, and fd (remember I have dir > and sd running on the same host). I can send full traces if needed, but the > most prevalent error from all three traces is something like this: > lib/tls_openssl_private.cc:325-60268 SSL_get_error() returned error value 2 > Sometimes the error code returned is 5, but it's usually 2. > > I've been running bareos for several years without any problems and this > is the first major one I've hit. I would love to know what changed and if > there's anything that can be done to compensate for it. All my other fd > hosts are running jobs just fine. I don't believe most of the rest of them > are running bareos 23.0.3 releases. My next step is going to be to migrate > my dir/sd host to debain 12, hoping that comparable ssl libs will help. But > if there's anything else that can be done for a quicker fix, I'd appreciate > some advice. > > Thanks. > Seth > > -- You received this message because you are subscribed to the Google Groups "bareos-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/bareos-users/06ece7e7-37fe-4d5e-8669-3d6ecf51f306n%40googlegroups.com.
