Re: [Bacula-users] Mysteriously failing jobs
Arno Lehmann wrote: > Or, alternatively, using tcpdump to find if the sequence numbers get out > of sync somewhere, which would cause a RST on both ends. Okay, I got a tcpdump and logfile of -d1000 on the fd. I'm a little rusty debugging TCP issues by hand, but I couldn't find anything that looked too out of the ordinary. In the logfile, the only thing that looked strange to me were these messages (extra linebreaks added for readability): ivanova-fd: backup.c:876 Send data to SD len=65536 ivanova-fd: message.c:606 Enter dispatch_msg type=4 msg=ivanova-fd: ERROR in openssl.c:74 TLS read/write failure.: ERR=error:140943FC:SSL routines:SSL3_READ_BYTES:sslv3 alert bad record mac ivanova-fd: message.c:768 DIRECTOR for following msg: ivanova-fd: ERROR in openssl.c:74 TLS read/write failure.: ERR=error:140943FC:SSL routines:SSL3_READ_BYTES:sslv3 alert bad record mac ivanova-fd: heartbeat.c:90 Got BNET_SIG 0 from SD ivanova-fd: heartbeat.c:95 wait_intr=1 stop=1 ivanova-fd: backup.c:876 Send data to SD len=65536 The tcpdump and log files are at http://erwin.wpi.edu/~fs/bacula-crash/ if anyone wants to take a closer look and see if I've missed anything. They're about 14M total. Anyone have any other ideas, or do I need to file a bug report on this one? -- Frank Sweetser fs at wpi.edu | For every problem, there is a solution that WPI Senior Network Engineer | is simple, elegant, and wrong. - HL Mencken GPG fingerprint = 6174 1257 129E 0D21 D8D4 E8A3 8E39 29E3 E2E8 8CEC - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Mysteriously failing jobs
Arno Lehmann wrote: > I'm not a good debugger user, but strace might be the next thing to > try... like capturing all socket operations, or something. Perhaps you > get to know if the error is cause by the OS on one end. Knowing how verbose strace can be, I'm a little hesitant to jump right to that. > Or, alternatively, using tcpdump to find if the sequence numbers get out > of sync somewhere, which would cause a RST on both ends. I'll try getting a headers only tcpdump from both ends. Hopefully that, along with -d100 on the FD, will produce something insightful. -- Frank Sweetser fs at wpi.edu | For every problem, there is a solution that WPI Senior Network Engineer | is simple, elegant, and wrong. - HL Mencken GPG fingerprint = 6174 1257 129E 0D21 D8D4 E8A3 8E39 29E3 E2E8 8CEC - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Mysteriously failing jobs
Hello, On 6/6/2007 3:38 PM, Frank Sweetser wrote: > Well, I had a failure last night while I was monitoring memory usage. I had a > script snagging the output of ps -o rss for both bacula-sd and bacula-dir > every 60 seconds. Based on that, memory usage for both jumped only by a few > megs when the jobs started. The dir was around 20M, and the sd around 13M. Quite sane numbers. Well, that quite certainly rules out the idea of memory consumption causing your problems. > I'll try to see if I can capture a failure with debug options at least on the > FD cranked up... I'm not a good debugger user, but strace might be the next thing to try... like capturing all socket operations, or something. Perhaps you get to know if the error is cause by the OS on one end. Or, alternatively, using tcpdump to find if the sequence numbers get out of sync somewhere, which would cause a RST on both ends. Arno -- IT-Service Lehmann[EMAIL PROTECTED] Arno Lehmann http://www.its-lehmann.de - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Mysteriously failing jobs
Well, I had a failure last night while I was monitoring memory usage. I had a script snagging the output of ps -o rss for both bacula-sd and bacula-dir every 60 seconds. Based on that, memory usage for both jumped only by a few megs when the jobs started. The dir was around 20M, and the sd around 13M. I'll try to see if I can capture a failure with debug options at least on the FD cranked up... -- Frank Sweetser fs at wpi.edu | For every problem, there is a solution that WPI Senior Network Engineer | is simple, elegant, and wrong. - HL Mencken GPG fingerprint = 6174 1257 129E 0D21 D8D4 E8A3 8E39 29E3 E2E8 8CEC - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Mysteriously failing jobs
Arno Lehmann wrote: > If you need a minimal Nagios plugin - I wrote some shell script for that > purpose once :-) Oddly enough, nothing actually crashes - a handfull of jobs fail, but all subsequent ones go through just fne. >>> A work around would be to not start all your jobs at once but run them >>> in batches. Lowering job concurrency will not work as a job waiting for >>> an available slot to run will also use memory. >>> >>> Also, you could try upgrading to the current development version as I >>> believe Kern worked on that problem. You should check the change log. >> I think I might at least wait until Kern releases an official beta before >> trying that one out =) > > 2.1.10 IS kind of a released beta version :-) but the next one is doe > soon... I'll definitely give that a try if no other solutions pop up... -- Frank Sweetser fs at wpi.edu | For every problem, there is a solution that WPI Senior Network Engineer | is simple, elegant, and wrong. - HL Mencken GPG fingerprint = 6174 1257 129E 0D21 D8D4 E8A3 8E39 29E3 E2E8 8CEC - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Mysteriously failing jobs
Hi, On 6/4/2007 6:17 PM, Frank Sweetser wrote: > Arno Lehmann wrote: >> Well, this one looks difficult. > > At least it's not just me, then =) > >> I suggest to monitor the memory usage of your server. I experienced >> problems with (usually) the DIR or (seldomly) the SD using up all >> available memory. Wich probably might affect the kernel so that it can't >> allocate memory for the network stuff. > > That would explain why nothing visibly changed. One or two jobs simply pushed > some internal resource over the magic threshold, and triggered the memory > consumption. Quite possible, in my experience. >> You should have something in the systems log files then, I suppose. > > I didn't find anything that appeared related in the log files. I have a quick > and dirty system in place to monitor the memory usage of the dir and sd that > I'll run through tonight's jobs, so we'll see how that looks. If you need a minimal Nagios plugin - I wrote some shell script for that purpose once :-) >> A work around would be to not start all your jobs at once but run them >> in batches. Lowering job concurrency will not work as a job waiting for >> an available slot to run will also use memory. >> >> Also, you could try upgrading to the current development version as I >> believe Kern worked on that problem. You should check the change log. > > I think I might at least wait until Kern releases an official beta before > trying that one out =) 2.1.10 IS kind of a released beta version :-) but the next one is doe soon... Arno -- IT-Service Lehmann[EMAIL PROTECTED] Arno Lehmann http://www.its-lehmann.de - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Mysteriously failing jobs
Arno Lehmann wrote: > Well, this one looks difficult. At least it's not just me, then =) > I suggest to monitor the memory usage of your server. I experienced > problems with (usually) the DIR or (seldomly) the SD using up all > available memory. Wich probably might affect the kernel so that it can't > allocate memory for the network stuff. That would explain why nothing visibly changed. One or two jobs simply pushed some internal resource over the magic threshold, and triggered the memory consumption. > You should have something in the systems log files then, I suppose. I didn't find anything that appeared related in the log files. I have a quick and dirty system in place to monitor the memory usage of the dir and sd that I'll run through tonight's jobs, so we'll see how that looks. > A work around would be to not start all your jobs at once but run them > in batches. Lowering job concurrency will not work as a job waiting for > an available slot to run will also use memory. > > Also, you could try upgrading to the current development version as I > believe Kern worked on that problem. You should check the change log. I think I might at least wait until Kern releases an official beta before trying that one out =) -- Frank Sweetser fs at wpi.edu | For every problem, there is a solution that WPI Senior Network Engineer | is simple, elegant, and wrong. - HL Mencken GPG fingerprint = 6174 1257 129E 0D21 D8D4 E8A3 8E39 29E3 E2E8 8CEC - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Mysteriously failing jobs
Hi, On 6/2/2007 7:43 AM, Frank Sweetser wrote: > A couple of weeks ago, a problem started cropping up. Jobs started failing > with what look like network errors: > > 02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Fatal error: > append.c:259 Network error on data channel. ERR=Input/output error > 02-Jun 01:10 lorien-sd: Job write elapsed time = 00:03:16, Transfer rate = > 4.157 M bytes/second > 02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Error: bnet.c:280 Read > expected 65536 got 16384 from client:130.215.39.18:36643 > 02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Fatal error: Network > error with FD during Backup: ERR=No data available > 02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Fatal error: No Job > status returned from FD. > 02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Error: Bacula 2.0.3 > (06Mar07): 02-Jun-2007 01:10:40 > > > However, I can find no evidence of any actual network problem between the > machine running the fd and the one running both the sd and dir: > > - The network monitoring system shows no outages, and none of the switches > and routers in between show anything out of the ordinary in the logs. > > - There is no external firewall between the two system. Both ends are linux > 2.6 with iptables, with non-stateful rules for all bacula traffic. > > - IP flow logs show that both ends of the FD -> SD TCP connection > ungracefully closed down the stream with a RST after a very short idle period > of about 10 seconds. > > - I've already tried swapping to a different NIC on the server to rule out a > dying network card. > > - The failure occurs on different machines, ruling out something specific to > one client, though it usually appears to affect the same one. More > specifically, it always seems to die around the same time - about ten minutes > after the batch of nightly jobs start. I have things configured to run four > concurrent jobs, and the failures will cancel anywhere from one to four jobs. > When multiple jobs die, they all do so at the same time. I can influence > which clients get picked on by shuffling around priorities. Well, this one looks difficult. I suggest to monitor the memory usage of your server. I experienced problems with (usually) the DIR or (seldomly) the SD using up all available memory. Wich probably might affect the kernel so that it can't allocate memory for the network stuff. You should have something in the systems log files then, I suppose. A work around would be to not start all your jobs at once but run them in batches. Lowering job concurrency will not work as a job waiting for an available slot to run will also use memory. Also, you could try upgrading to the current development version as I believe Kern worked on that problem. You should check the change log. Hope you get this fixed, Arno > - Running the failed job - either by itself or queued up with a bunch of > other ones - always appear to work as expected. > > The part *really* driving me bonkers is that I can find no evidence of any > changes that coincide with the problem starting. Bacula version, kernel > version, hardware, network - nothing was changed. > > If anyone has any suggestions where I could start looking, I'd love to hear > them. > -- IT-Service Lehmann[EMAIL PROTECTED] Arno Lehmann http://www.its-lehmann.de - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
[Bacula-users] Mysteriously failing jobs
A couple of weeks ago, a problem started cropping up. Jobs started failing with what look like network errors: 02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Fatal error: append.c:259 Network error on data channel. ERR=Input/output error 02-Jun 01:10 lorien-sd: Job write elapsed time = 00:03:16, Transfer rate = 4.157 M bytes/second 02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Error: bnet.c:280 Read expected 65536 got 16384 from client:130.215.39.18:36643 02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Fatal error: Network error with FD during Backup: ERR=No data available 02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Fatal error: No Job status returned from FD. 02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Error: Bacula 2.0.3 (06Mar07): 02-Jun-2007 01:10:40 However, I can find no evidence of any actual network problem between the machine running the fd and the one running both the sd and dir: - The network monitoring system shows no outages, and none of the switches and routers in between show anything out of the ordinary in the logs. - There is no external firewall between the two system. Both ends are linux 2.6 with iptables, with non-stateful rules for all bacula traffic. - IP flow logs show that both ends of the FD -> SD TCP connection ungracefully closed down the stream with a RST after a very short idle period of about 10 seconds. - I've already tried swapping to a different NIC on the server to rule out a dying network card. - The failure occurs on different machines, ruling out something specific to one client, though it usually appears to affect the same one. More specifically, it always seems to die around the same time - about ten minutes after the batch of nightly jobs start. I have things configured to run four concurrent jobs, and the failures will cancel anywhere from one to four jobs. When multiple jobs die, they all do so at the same time. I can influence which clients get picked on by shuffling around priorities. - Running the failed job - either by itself or queued up with a bunch of other ones - always appear to work as expected. The part *really* driving me bonkers is that I can find no evidence of any changes that coincide with the problem starting. Bacula version, kernel version, hardware, network - nothing was changed. If anyone has any suggestions where I could start looking, I'd love to hear them. -- Frank Sweetser fs at wpi.edu | For every problem, there is a solution that WPI Senior Network Engineer | is simple, elegant, and wrong. - HL Mencken GPG fingerprint = 6174 1257 129E 0D21 D8D4 E8A3 8E39 29E3 E2E8 8CEC - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users