Re: [Bacula-users] [Bacula-devel] 5.0.1 infinite email loop bug??

2010-04-15 Thread Stephen Thompson


Additionally, seems like the SD was possibly reading a new 
freshly-labeled tape when it crashed...  Last items in bacula log 
besides alerts already mentioned:


15-Apr 09:31 server-sd JobId 10: Writing spooled data to Volume. 
Despooling 35,000,185,219 bytes ...
15-Apr 09:51 server-sd JobId 10: End of Volume FB0568 at 888:1414 
on device SL500-Drive-1 (/dev/nst0). Write of 262144 bytes got -1.
15-Apr 09:51 server-sd JobId 10: Re-read of last block succeeded.
15-Apr 09:51 server-sd JobId 10: End of medium on Volume FB0568 
Bytes=887,261,470,720 Blocks=3,384,635 at 15-Apr-2010 09:51.
15-Apr 09:51 server-sd JobId 10: 3307 Issuing autochanger unload 
slot 38, drive 1 command.
15-Apr 09:52 server-sd JobId 10: 3301 Issuing autochanger loaded? 
drive 1 command.
15-Apr 09:52 server-sd JobId 10: 3302 Autochanger loaded? drive 1, 
result: nothing loaded.
15-Apr 09:52 server-sd JobId 10: 3304 Issuing autochanger load slot 
39, drive 1 command.
15-Apr 09:52 server-sd JobId 10: 3305 Autochanger load slot 39, 
drive 1, status is OK.
15-Apr 09:52 server-sd JobId 10: Volume FB0569 previously written, 
moving to end of data.

Nothing but thousands of 'repetitive' alerts after that...

thanks again,
Stephen



On 04/15/2010 10:25 AM, Stephen Thompson wrote:

 Hello,

 I have just now experienced a possible new bug with bacula 5.0.1.

 The symptoms are this:

 bacula-sd crashes
 bacula-dir continues to run
 bacula-dir then spews out identical Intervention needed emails until
 manually restarted

 The first time this happened over a weekend and upon returning I found
 my inbox has about 120,000 bacula emails, all the SAME and of this type:

 15-Apr 10:02 client-fd JobId 11: Fatal error: backup.c:1048 Network
 send error to SD. ERR=Broken pipe

 It happened again just now (second time since upgrading from 3.0.3 to
 5.0.1) and I managed to stop the director with only a few thousand
 emails going out.

 So there are really 2 issues here:

 1)
 Why does the director apparently get stuck in an infinite loop of
 sending the same email message?  Is this a known bug?

 2)
 Regarding the SD, I received one alert of this type, the rest like the
 above:

15-Apr 10:02 server-sd: ERROR in lock.c:268 Failed ASSERT:
 dev-blocked()

 A traceback like:
 --
 ptrace: Operation not permitted.
 /var/bacula/work/29091: No such file or directory.
 $1 = 0
 /opt/bacula-5.0.1/scripts/btraceback.gdb:2: Error in sourced command file:
 No symbol exename in current context.
 --

 And a bactrace like:
 --
 Attempt to dump current JCRs
 JCR=0x19a24888 JobId=10 name=client_1.2010-04-14_18.02.33_41 JobStatus=l
   use_count=1
   JobType=B JobLevel=F
   sched_time=14-Apr-2010 21:35 start_time=14-Apr-2010 21:35
   end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00
   db=(nil) db_batch=(nil) batch_started=0
 JCR=0x1981b248 JobId=11 name=client_10.2010-04-14_20.00.15_04
 JobStatus=R
   use_count=1
   JobType=B JobLevel=I
   sched_time=15-Apr-2010 09:15 start_time=15-Apr-2010 09:15
   end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00
   db=(nil) db_batch=(nil) batch_started=0
 Attempt to dump plugins. Hook count=0
 --

 Both clients and server seem healthy, except for the SD crash.
 Any ideas?


 thanks!
 Stephen


 -
 Further info:

 My catalog...

   mysql-5.0.77 (64bit) MyISAM
   210Gb in size
   1,412,297,215 records in File table
   note: database built with bacula 2x scripts,
   upgraded with 3x scripts, then again with 5x scripts
   (i.e. nothing customized along the way)

 My OS  hardware for bacula DIR+SD server...

   Centos 5.4 (fully patched)
   8Gb RAM
   2Gb Swap
   1Tb EXT3 filesystem on external fiber RAID5 array
   (dedicated to database, incl. temp files)
   2 dual-core [AMD Opteron(tm) Processor 2220] CPUs
   StorageTek SL500 Library with 2 LTO3 Drives





 --
 Download Intel#174; Parallel Studio Eval
 Try the new software tools for yourself. Speed compiling, find bugs
 proactively, and fine-tune applications for parallel performance.
 See why Intel Parallel Studio got high marks during beta.
 http://p.sf.net/sfu/intel-sw-dev
 ___
 Bacula-devel mailing list
 bacula-de...@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/bacula-devel


-- 
Stephen Thompson   Berkeley Seismological Laboratory
step...@seismo.berkeley.edu215 McCone Hall # 4760
404.538.7077 (phone)   University of California, Berkeley
510.643.5811 (fax) Berkeley, CA 94720-4760

--
Download Intel#174; Parallel Studio Eval
Try the new software tools for 

Re: [Bacula-users] [Bacula-devel] 5.0.1 infinite email loop bug??

2010-04-15 Thread Kern Sibbald
On Thursday 15 April 2010 19:25:34 Stephen Thompson wrote:
 Hello,

 I have just now experienced a possible new bug with bacula 5.0.1.

 The symptoms are this:

 bacula-sd crashes
 bacula-dir continues to run
 bacula-dir then spews out identical Intervention needed emails until
 manually restarted

 The first time this happened over a weekend and upon returning I found
 my inbox has about 120,000 bacula emails, all the SAME and of this type:

 15-Apr 10:02 client-fd JobId 11: Fatal error: backup.c:1048 Network
 send error to SD. ERR=Broken pipe

 It happened again just now (second time since upgrading from 3.0.3 to
 5.0.1) and I managed to stop the director with only a few thousand
 emails going out.

 So there are really 2 issues here:

 1)
 Why does the director apparently get stuck in an infinite loop of
 sending the same email message? 

I have no idea.

 Is this a known bug? 

No, I have never heard or seen this kind of problem before.


 2)
 Regarding the SD, I received one alert of this type, 

What is an alert?  Do you mean an email or a Bacula message?  If so, which 
one?


 the rest like the above:

   15-Apr 10:02 server-sd: ERROR in lock.c:268 Failed ASSERT:
 dev-blocked()

It is not very clear what you are saying.  Do you mean that you are receiving 
the above message many times?


 A traceback like:
 --
 ptrace: Operation not permitted.
 /var/bacula/work/29091: No such file or directory.
 $1 = 0
 /opt/bacula-5.0.1/scripts/btraceback.gdb:2: Error in sourced command file:
 No symbol exename in current context.
 --

 And a bactrace like:
 --
 Attempt to dump current JCRs
 JCR=0x19a24888 JobId=10 name=client_1.2010-04-14_18.02.33_41
 JobStatus=l use_count=1
  JobType=B JobLevel=F
  sched_time=14-Apr-2010 21:35 start_time=14-Apr-2010 21:35
  end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00
  db=(nil) db_batch=(nil) batch_started=0
 JCR=0x1981b248 JobId=11 name=client_10.2010-04-14_20.00.15_04
 JobStatus=R
  use_count=1
  JobType=B JobLevel=I
  sched_time=15-Apr-2010 09:15 start_time=15-Apr-2010 09:15
  end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00
  db=(nil) db_batch=(nil) batch_started=0
 Attempt to dump plugins. Hook count=0
 --

 Both clients and server seem healthy, except for the SD crash.
 Any ideas?

No.  To understand the problem we will need a traceback that you will probably 
need to produce manually as described in the Kaboom chapter, or you will need 
to fix the automatic traceback scripts so that they can do a ptrace.

Kern



 thanks!
 Stephen


 ---
-- Further info:

 My catalog...

  mysql-5.0.77 (64bit) MyISAM
  210Gb in size
  1,412,297,215 records in File table
  note: database built with bacula 2x scripts,
  upgraded with 3x scripts, then again with 5x scripts
  (i.e. nothing customized along the way)

 My OS  hardware for bacula DIR+SD server...

  Centos 5.4 (fully patched)
  8Gb RAM
  2Gb Swap
  1Tb EXT3 filesystem on external fiber RAID5 array
  (dedicated to database, incl. temp files)
  2 dual-core [AMD Opteron(tm) Processor 2220] CPUs
  StorageTek SL500 Library with 2 LTO3 Drives





 ---
--- Download Intel#174; Parallel Studio Eval
 Try the new software tools for yourself. Speed compiling, find bugs
 proactively, and fine-tune applications for parallel performance.
 See why Intel Parallel Studio got high marks during beta.
 http://p.sf.net/sfu/intel-sw-dev
 ___
 Bacula-devel mailing list
 bacula-de...@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/bacula-devel



--
Download Intel#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] [Bacula-devel] 5.0.1 infinite email loop bug??

2010-04-15 Thread Kern Sibbald
On Thursday 15 April 2010 19:36:51 Stephen Thompson wrote:
 Additionally, seems like the SD was possibly reading a new
 freshly-labeled tape when it crashed...  Last items in bacula log
 besides alerts already mentioned:

In Bacula alerts refer to tape drive information stored concerning tape 
problems, so I am assuming you mean messages.



 15-Apr 09:31 server-sd JobId 10: Writing spooled data to Volume.
 Despooling 35,000,185,219 bytes ...
 15-Apr 09:51 server-sd JobId 10: End of Volume FB0568 at 888:1414
 on device SL500-Drive-1 (/dev/nst0). Write of 262144 bytes got -1.
 15-Apr 09:51 server-sd JobId 10: Re-read of last block succeeded.
 15-Apr 09:51 server-sd JobId 10: End of medium on Volume FB0568
 Bytes=887,261,470,720 Blocks=3,384,635 at 15-Apr-2010 09:51.
 15-Apr 09:51 server-sd JobId 10: 3307 Issuing autochanger unload
 slot 38, drive 1 command.
 15-Apr 09:52 server-sd JobId 10: 3301 Issuing autochanger loaded?
 drive 1 command.
 15-Apr 09:52 server-sd JobId 10: 3302 Autochanger loaded? drive 1,
 result: nothing loaded.
 15-Apr 09:52 server-sd JobId 10: 3304 Issuing autochanger load slot
 39, drive 1 command.
 15-Apr 09:52 server-sd JobId 10: 3305 Autochanger load slot 39,
 drive 1, status is OK.
 15-Apr 09:52 server-sd JobId 10: Volume FB0569 previously written,
 moving to end of data.

 Nothing but thousands of 'repetitive' alerts after that...

What exactly is repeated?

There was a Bacula bug #1480 in message delivery that may be the same that you 
are experiencing, it was triggered by a misconfigured SMTP server or by a 
reference in Bacula to a non-existent SMTP server  - and the simple solution 
is to make sure Bacula points to a valid functional SMTP server.  This 
problem was not particular to version 5.0.1, but I think it was fixed after 
the release of 5.0.1.  Please see the bugs database for more details.

Kern


 thanks again,
 Stephen

 On 04/15/2010 10:25 AM, Stephen Thompson wrote:
  Hello,
 
  I have just now experienced a possible new bug with bacula 5.0.1.
 
  The symptoms are this:
 
  bacula-sd crashes
  bacula-dir continues to run
  bacula-dir then spews out identical Intervention needed emails until
  manually restarted
 
  The first time this happened over a weekend and upon returning I found
  my inbox has about 120,000 bacula emails, all the SAME and of this type:
 
  15-Apr 10:02 client-fd JobId 11: Fatal error: backup.c:1048 Network
  send error to SD. ERR=Broken pipe
 
  It happened again just now (second time since upgrading from 3.0.3 to
  5.0.1) and I managed to stop the director with only a few thousand
  emails going out.
 
  So there are really 2 issues here:
 
  1)
  Why does the director apparently get stuck in an infinite loop of
  sending the same email message?  Is this a known bug?
 
  2)
  Regarding the SD, I received one alert of this type, the rest like the
  above:
 
 15-Apr 10:02 server-sd: ERROR in lock.c:268 Failed ASSERT:
  dev-blocked()
 
  A traceback like:
  --
  ptrace: Operation not permitted.
  /var/bacula/work/29091: No such file or directory.
  $1 = 0
  /opt/bacula-5.0.1/scripts/btraceback.gdb:2: Error in sourced command
  file: No symbol exename in current context.
  --
 
  And a bactrace like:
  --
  Attempt to dump current JCRs
  JCR=0x19a24888 JobId=10 name=client_1.2010-04-14_18.02.33_41
  JobStatus=l use_count=1
JobType=B JobLevel=F
sched_time=14-Apr-2010 21:35 start_time=14-Apr-2010 21:35
end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00
db=(nil) db_batch=(nil) batch_started=0
  JCR=0x1981b248 JobId=11 name=client_10.2010-04-14_20.00.15_04
  JobStatus=R
use_count=1
JobType=B JobLevel=I
sched_time=15-Apr-2010 09:15 start_time=15-Apr-2010 09:15
end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00
db=(nil) db_batch=(nil) batch_started=0
  Attempt to dump plugins. Hook count=0
  --
 
  Both clients and server seem healthy, except for the SD crash.
  Any ideas?
 
 
  thanks!
  Stephen
 
 
  -
  Further info:
 
  My catalog...
 
mysql-5.0.77 (64bit) MyISAM
210Gb in size
1,412,297,215 records in File table
note: database built with bacula 2x scripts,
upgraded with 3x scripts, then again with 5x scripts
(i.e. nothing customized along the way)
 
  My OS  hardware for bacula DIR+SD server...
 
Centos 5.4 (fully patched)
8Gb RAM
2Gb Swap
1Tb EXT3 filesystem on external fiber RAID5 array
(dedicated to database, incl. temp files)
2 dual-core [AMD Opteron(tm) Processor 2220] CPUs
StorageTek SL500 Library with 2 LTO3 Drives
 
 
 
 
 
  -
 - Download Intel#174; Parallel Studio Eval
  Try the new software tools for 

Re: [Bacula-users] [Bacula-devel] 5.0.1 infinite email loop bug??

2010-04-15 Thread Stephen Thompson

Hello,

Thanks for the response.

No, it's nothing to do with mail configuration; 100% sure of that.
(I know people say that all the time, but, seriously, it's the director).

And by alerts, I do mean Messages in the bacula vernacular.

The first time this crash happened, we received 120,000 Messages in the 
form of emails to our administrative account.  The messages were 
identical both to each other and to the content of the $JOB.mail file in 
our bacula working directory (which is never removed automatically after 
one of these crashes - perhaps that causes the endless cycle).  The same 
Message also appears to be written to our bacula log file each time an 
email is generated (or vice versa).

It seems to me like it's possible for the director to get stuck in a 
loop and send the contents of that mail file again and again, 
infinitely.  Both times we've had the SD crash (both have happened since 
upgrading to 5.0.1), the only thing that stopped the Message generation 
was stopping the director itself.

Of course, that's the annoying symptom.  The more serious problem is our 
the crash of our SD.  Any pointers to getting ptrace working with the 
automatic scripts?

thanks!
Stephen






On 04/15/2010 12:40 PM, Kern Sibbald wrote:
 On Thursday 15 April 2010 19:36:51 Stephen Thompson wrote:
 Additionally, seems like the SD was possibly reading a new
 freshly-labeled tape when it crashed...  Last items in bacula log
 besides alerts already mentioned:

 In Bacula alerts refer to tape drive information stored concerning tape
 problems, so I am assuming you mean messages.



 15-Apr 09:31 server-sd JobId 10: Writing spooled data to Volume.
 Despooling 35,000,185,219 bytes ...
 15-Apr 09:51 server-sd JobId 10: End of Volume FB0568 at 888:1414
 on device SL500-Drive-1 (/dev/nst0). Write of 262144 bytes got -1.
 15-Apr 09:51 server-sd JobId 10: Re-read of last block succeeded.
 15-Apr 09:51 server-sd JobId 10: End of medium on Volume FB0568
 Bytes=887,261,470,720 Blocks=3,384,635 at 15-Apr-2010 09:51.
 15-Apr 09:51 server-sd JobId 10: 3307 Issuing autochanger unload
 slot 38, drive 1 command.
 15-Apr 09:52 server-sd JobId 10: 3301 Issuing autochanger loaded?
 drive 1 command.
 15-Apr 09:52 server-sd JobId 10: 3302 Autochanger loaded? drive 1,
 result: nothing loaded.
 15-Apr 09:52 server-sd JobId 10: 3304 Issuing autochanger load slot
 39, drive 1 command.
 15-Apr 09:52 server-sd JobId 10: 3305 Autochanger load slot 39,
 drive 1, status is OK.
 15-Apr 09:52 server-sd JobId 10: Volume FB0569 previously written,
 moving to end of data.

 Nothing but thousands of 'repetitive' alerts after that...

 What exactly is repeated?

 There was a Bacula bug #1480 in message delivery that may be the same that you
 are experiencing, it was triggered by a misconfigured SMTP server or by a
 reference in Bacula to a non-existent SMTP server  - and the simple solution
 is to make sure Bacula points to a valid functional SMTP server.  This
 problem was not particular to version 5.0.1, but I think it was fixed after
 the release of 5.0.1.  Please see the bugs database for more details.

 Kern


 thanks again,
 Stephen

 On 04/15/2010 10:25 AM, Stephen Thompson wrote:
 Hello,

 I have just now experienced a possible new bug with bacula 5.0.1.

 The symptoms are this:

 bacula-sd crashes
 bacula-dir continues to run
 bacula-dir then spews out identical Intervention needed emails until
 manually restarted

 The first time this happened over a weekend and upon returning I found
 my inbox has about 120,000 bacula emails, all the SAME and of this type:

 15-Apr 10:02 client-fd JobId 11: Fatal error: backup.c:1048 Network
 send error to SD. ERR=Broken pipe

 It happened again just now (second time since upgrading from 3.0.3 to
 5.0.1) and I managed to stop the director with only a few thousand
 emails going out.

 So there are really 2 issues here:

 1)
 Why does the director apparently get stuck in an infinite loop of
 sending the same email message?  Is this a known bug?

 2)
 Regarding the SD, I received one alert of this type, the rest like the
 above:

 15-Apr 10:02 server-sd: ERROR in lock.c:268 Failed ASSERT:
 dev-blocked()

 A traceback like:
 --
 ptrace: Operation not permitted.
 /var/bacula/work/29091: No such file or directory.
 $1 = 0
 /opt/bacula-5.0.1/scripts/btraceback.gdb:2: Error in sourced command
 file: No symbol exename in current context.
 --

 And a bactrace like:
 --
 Attempt to dump current JCRs
 JCR=0x19a24888 JobId=10 name=client_1.2010-04-14_18.02.33_41
 JobStatus=l use_count=1
JobType=B JobLevel=F
sched_time=14-Apr-2010 21:35 start_time=14-Apr-2010 21:35
end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00
db=(nil) db_batch=(nil) batch_started=0
 JCR=0x1981b248 JobId=11 name=client_10.2010-04-14_20.00.15_04
 JobStatus=R
use_count=1
JobType=B JobLevel=I