Re: [Bacula-users] Mysteriously failing jobs

2007-06-08 Thread Frank Sweetser
Arno Lehmann wrote:

> Or, alternatively, using tcpdump to find if the sequence numbers get out 
> of sync somewhere, which would cause a RST on both ends.

Okay, I got a tcpdump and logfile of -d1000 on the fd.  I'm a little rusty
debugging TCP issues by hand, but I couldn't find anything that looked too out
of the ordinary.

In the logfile, the only thing that looked strange to me were these messages
(extra linebreaks added for readability):

ivanova-fd: backup.c:876 Send data to SD len=65536

ivanova-fd: message.c:606 Enter dispatch_msg type=4 msg=ivanova-fd: ERROR in
openssl.c:74 TLS read/write failure.: ERR=error:140943FC:SSL
routines:SSL3_READ_BYTES:sslv3 alert bad record mac

ivanova-fd: message.c:768 DIRECTOR for following msg: ivanova-fd: ERROR in
openssl.c:74 TLS read/write failure.: ERR=error:140943FC:SSL
routines:SSL3_READ_BYTES:sslv3 alert bad record mac

ivanova-fd: heartbeat.c:90 Got BNET_SIG 0 from SD

ivanova-fd: heartbeat.c:95 wait_intr=1 stop=1

ivanova-fd: backup.c:876 Send data to SD len=65536

The tcpdump and log files are at http://erwin.wpi.edu/~fs/bacula-crash/ if
anyone wants to take a closer look and see if I've missed anything.  They're
about 14M total.

Anyone have any other ideas, or do I need to file a bug report on this one?

-- 
Frank Sweetser fs at wpi.edu  |  For every problem, there is a solution that
WPI Senior Network Engineer   |  is simple, elegant, and wrong. - HL Mencken
GPG fingerprint = 6174 1257 129E 0D21 D8D4  E8A3 8E39 29E3 E2E8 8CEC

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Mysteriously failing jobs

2007-06-06 Thread Frank Sweetser
Arno Lehmann wrote:
> I'm not a good debugger user, but strace might be the next thing to 
> try... like capturing all socket operations, or something. Perhaps you 
> get to know if the error is cause by the OS on one end.

Knowing how verbose strace can be, I'm a little hesitant to jump right to that.

> Or, alternatively, using tcpdump to find if the sequence numbers get out 
> of sync somewhere, which would cause a RST on both ends.

I'll try getting a headers only tcpdump from both ends.  Hopefully that, along
with -d100 on the FD, will produce something insightful.

-- 
Frank Sweetser fs at wpi.edu  |  For every problem, there is a solution that
WPI Senior Network Engineer   |  is simple, elegant, and wrong. - HL Mencken
GPG fingerprint = 6174 1257 129E 0D21 D8D4  E8A3 8E39 29E3 E2E8 8CEC

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Mysteriously failing jobs

2007-06-06 Thread Arno Lehmann
Hello,

On 6/6/2007 3:38 PM, Frank Sweetser wrote:
> Well, I had a failure last night while I was monitoring memory usage.  I had a
> script snagging the output of  ps -o rss for both bacula-sd and bacula-dir
> every 60 seconds.  Based on that, memory usage for both jumped only by a few
> megs when the jobs started.  The dir was around 20M, and the sd around 13M.

Quite sane numbers. Well, that quite certainly rules out the idea of 
memory consumption causing your problems.

> I'll try to see if I can capture a failure with debug options at least on the
> FD cranked up...

I'm not a good debugger user, but strace might be the next thing to 
try... like capturing all socket operations, or something. Perhaps you 
get to know if the error is cause by the OS on one end.

Or, alternatively, using tcpdump to find if the sequence numbers get out 
of sync somewhere, which would cause a RST on both ends.

Arno
-- 
IT-Service Lehmann[EMAIL PROTECTED]
Arno Lehmann  http://www.its-lehmann.de

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Mysteriously failing jobs

2007-06-06 Thread Frank Sweetser

Well, I had a failure last night while I was monitoring memory usage.  I had a
script snagging the output of  ps -o rss for both bacula-sd and bacula-dir
every 60 seconds.  Based on that, memory usage for both jumped only by a few
megs when the jobs started.  The dir was around 20M, and the sd around 13M.

I'll try to see if I can capture a failure with debug options at least on the
FD cranked up...

-- 
Frank Sweetser fs at wpi.edu  |  For every problem, there is a solution that
WPI Senior Network Engineer   |  is simple, elegant, and wrong. - HL Mencken
GPG fingerprint = 6174 1257 129E 0D21 D8D4  E8A3 8E39 29E3 E2E8 8CEC

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Mysteriously failing jobs

2007-06-04 Thread Frank Sweetser
Arno Lehmann wrote:
> If you need a minimal Nagios plugin - I wrote some shell script for that 
> purpose once :-)

Oddly enough, nothing actually crashes - a handfull of jobs fail, but all
subsequent ones go through just fne.

>>> A work around would be to not start all your jobs at once but run them 
>>> in batches. Lowering job concurrency will not work as a job waiting for 
>>> an available slot to run will also use memory.
>>>
>>> Also, you could try upgrading to the current development version as I 
>>> believe Kern worked on that problem. You should check the change log.
>> I think I might at least wait until Kern releases an official beta before
>> trying that one out =)
> 
> 2.1.10 IS kind of a released beta version :-) but the next one is doe 
> soon...

I'll definitely give that a try if no other solutions pop up...

-- 
Frank Sweetser fs at wpi.edu  |  For every problem, there is a solution that
WPI Senior Network Engineer   |  is simple, elegant, and wrong. - HL Mencken
GPG fingerprint = 6174 1257 129E 0D21 D8D4  E8A3 8E39 29E3 E2E8 8CEC

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Mysteriously failing jobs

2007-06-04 Thread Arno Lehmann
Hi,

On 6/4/2007 6:17 PM, Frank Sweetser wrote:
> Arno Lehmann wrote:
>> Well, this one looks difficult.
> 
> At least it's not just me, then =)
> 
>> I suggest to monitor the memory usage of your server. I experienced 
>> problems with (usually) the DIR or (seldomly) the SD using up all 
>> available memory. Wich probably might affect the kernel so that it can't 
>> allocate memory for the network stuff.
> 
> That would explain why nothing visibly changed.  One or two jobs simply pushed
> some internal resource over the magic threshold, and triggered the memory
> consumption.

Quite possible, in my experience.

>> You should have something in the systems log files then, I suppose.
> 
> I didn't find anything that appeared related in the log files.  I have a quick
> and dirty system in place to monitor the memory usage of the dir and sd that
> I'll run through tonight's jobs, so we'll see how that looks.

If you need a minimal Nagios plugin - I wrote some shell script for that 
purpose once :-)

>> A work around would be to not start all your jobs at once but run them 
>> in batches. Lowering job concurrency will not work as a job waiting for 
>> an available slot to run will also use memory.
>>
>> Also, you could try upgrading to the current development version as I 
>> believe Kern worked on that problem. You should check the change log.
> 
> I think I might at least wait until Kern releases an official beta before
> trying that one out =)

2.1.10 IS kind of a released beta version :-) but the next one is doe 
soon...

Arno


-- 
IT-Service Lehmann[EMAIL PROTECTED]
Arno Lehmann  http://www.its-lehmann.de

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Mysteriously failing jobs

2007-06-04 Thread Frank Sweetser
Arno Lehmann wrote:
> Well, this one looks difficult.

At least it's not just me, then =)

> I suggest to monitor the memory usage of your server. I experienced 
> problems with (usually) the DIR or (seldomly) the SD using up all 
> available memory. Wich probably might affect the kernel so that it can't 
> allocate memory for the network stuff.

That would explain why nothing visibly changed.  One or two jobs simply pushed
some internal resource over the magic threshold, and triggered the memory
consumption.

> You should have something in the systems log files then, I suppose.

I didn't find anything that appeared related in the log files.  I have a quick
and dirty system in place to monitor the memory usage of the dir and sd that
I'll run through tonight's jobs, so we'll see how that looks.

> A work around would be to not start all your jobs at once but run them 
> in batches. Lowering job concurrency will not work as a job waiting for 
> an available slot to run will also use memory.
> 
> Also, you could try upgrading to the current development version as I 
> believe Kern worked on that problem. You should check the change log.

I think I might at least wait until Kern releases an official beta before
trying that one out =)

-- 
Frank Sweetser fs at wpi.edu  |  For every problem, there is a solution that
WPI Senior Network Engineer   |  is simple, elegant, and wrong. - HL Mencken
GPG fingerprint = 6174 1257 129E 0D21 D8D4  E8A3 8E39 29E3 E2E8 8CEC

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Mysteriously failing jobs

2007-06-04 Thread Arno Lehmann
Hi,

On 6/2/2007 7:43 AM, Frank Sweetser wrote:
> A couple of weeks ago, a problem started cropping up.  Jobs started failing
> with what look like network errors:
> 
> 02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Fatal error:
> append.c:259 Network error on data channel. ERR=Input/output error
> 02-Jun 01:10 lorien-sd: Job write elapsed time = 00:03:16, Transfer rate =
> 4.157 M bytes/second
> 02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Error: bnet.c:280 Read
> expected 65536 got 16384 from client:130.215.39.18:36643
> 02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Fatal error: Network
> error with FD during Backup: ERR=No data available
> 02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Fatal error: No Job
> status returned from FD.
> 02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Error: Bacula 2.0.3
> (06Mar07): 02-Jun-2007 01:10:40
> 
> 
> However, I can find no evidence of any actual network problem between the
> machine running the fd and the one running both the sd and dir:
> 
>  - The network monitoring system shows no outages, and none of the switches
> and routers in between show anything out of the ordinary in the logs.
> 
>  - There is no external firewall between the two system.  Both ends are linux
> 2.6 with iptables, with non-stateful rules for all bacula traffic.
> 
>  - IP flow logs show that both ends of the FD -> SD TCP connection
> ungracefully closed down the stream with a RST after a very short idle period
> of about 10 seconds.
> 
>  - I've already tried swapping to a different NIC on the server to rule out a
> dying network card.
> 
>  - The failure occurs on different machines, ruling out something specific to
> one client, though it usually appears to affect the same one.  More
> specifically, it always seems to die around the same time - about ten minutes
> after the batch of nightly jobs start.  I have things configured to run four
> concurrent jobs, and the failures will cancel anywhere from one to four jobs.
>  When multiple jobs die, they all do so at the same time.  I can influence
> which clients get picked on by shuffling around priorities.

Well, this one looks difficult.

I suggest to monitor the memory usage of your server. I experienced 
problems with (usually) the DIR or (seldomly) the SD using up all 
available memory. Wich probably might affect the kernel so that it can't 
allocate memory for the network stuff.

You should have something in the systems log files then, I suppose.

A work around would be to not start all your jobs at once but run them 
in batches. Lowering job concurrency will not work as a job waiting for 
an available slot to run will also use memory.

Also, you could try upgrading to the current development version as I 
believe Kern worked on that problem. You should check the change log.

Hope you get this fixed,

Arno

>  - Running the failed job - either by itself or queued up with a bunch of
> other ones - always appear to work as expected.
> 
> The part *really* driving me bonkers is that I can find no evidence of any
> changes that coincide with the problem starting.  Bacula version, kernel
> version, hardware, network - nothing was changed.
> 
> If anyone has any suggestions where I could start looking, I'd love to hear 
> them.
> 

-- 
IT-Service Lehmann[EMAIL PROTECTED]
Arno Lehmann  http://www.its-lehmann.de

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


[Bacula-users] Mysteriously failing jobs

2007-06-01 Thread Frank Sweetser

A couple of weeks ago, a problem started cropping up.  Jobs started failing
with what look like network errors:

02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Fatal error:
append.c:259 Network error on data channel. ERR=Input/output error
02-Jun 01:10 lorien-sd: Job write elapsed time = 00:03:16, Transfer rate =
4.157 M bytes/second
02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Error: bnet.c:280 Read
expected 65536 got 16384 from client:130.215.39.18:36643
02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Fatal error: Network
error with FD during Backup: ERR=No data available
02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Fatal error: No Job
status returned from FD.
02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Error: Bacula 2.0.3
(06Mar07): 02-Jun-2007 01:10:40


However, I can find no evidence of any actual network problem between the
machine running the fd and the one running both the sd and dir:

 - The network monitoring system shows no outages, and none of the switches
and routers in between show anything out of the ordinary in the logs.

 - There is no external firewall between the two system.  Both ends are linux
2.6 with iptables, with non-stateful rules for all bacula traffic.

 - IP flow logs show that both ends of the FD -> SD TCP connection
ungracefully closed down the stream with a RST after a very short idle period
of about 10 seconds.

 - I've already tried swapping to a different NIC on the server to rule out a
dying network card.

 - The failure occurs on different machines, ruling out something specific to
one client, though it usually appears to affect the same one.  More
specifically, it always seems to die around the same time - about ten minutes
after the batch of nightly jobs start.  I have things configured to run four
concurrent jobs, and the failures will cancel anywhere from one to four jobs.
 When multiple jobs die, they all do so at the same time.  I can influence
which clients get picked on by shuffling around priorities.

 - Running the failed job - either by itself or queued up with a bunch of
other ones - always appear to work as expected.

The part *really* driving me bonkers is that I can find no evidence of any
changes that coincide with the problem starting.  Bacula version, kernel
version, hardware, network - nothing was changed.

If anyone has any suggestions where I could start looking, I'd love to hear 
them.

-- 
Frank Sweetser fs at wpi.edu  |  For every problem, there is a solution that
WPI Senior Network Engineer   |  is simple, elegant, and wrong. - HL Mencken
GPG fingerprint = 6174 1257 129E 0D21 D8D4  E8A3 8E39 29E3 E2E8 8CEC

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users