Re: [Bacula-users] [Bacula-devel] 5.0.1 infinite email loop bug??
Additionally, seems like the SD was possibly reading a new freshly-labeled tape when it crashed... Last items in bacula log besides alerts already mentioned: 15-Apr 09:31 server-sd JobId 10: Writing spooled data to Volume. Despooling 35,000,185,219 bytes ... 15-Apr 09:51 server-sd JobId 10: End of Volume FB0568 at 888:1414 on device SL500-Drive-1 (/dev/nst0). Write of 262144 bytes got -1. 15-Apr 09:51 server-sd JobId 10: Re-read of last block succeeded. 15-Apr 09:51 server-sd JobId 10: End of medium on Volume FB0568 Bytes=887,261,470,720 Blocks=3,384,635 at 15-Apr-2010 09:51. 15-Apr 09:51 server-sd JobId 10: 3307 Issuing autochanger unload slot 38, drive 1 command. 15-Apr 09:52 server-sd JobId 10: 3301 Issuing autochanger loaded? drive 1 command. 15-Apr 09:52 server-sd JobId 10: 3302 Autochanger loaded? drive 1, result: nothing loaded. 15-Apr 09:52 server-sd JobId 10: 3304 Issuing autochanger load slot 39, drive 1 command. 15-Apr 09:52 server-sd JobId 10: 3305 Autochanger load slot 39, drive 1, status is OK. 15-Apr 09:52 server-sd JobId 10: Volume FB0569 previously written, moving to end of data. Nothing but thousands of 'repetitive' alerts after that... thanks again, Stephen On 04/15/2010 10:25 AM, Stephen Thompson wrote: Hello, I have just now experienced a possible new bug with bacula 5.0.1. The symptoms are this: bacula-sd crashes bacula-dir continues to run bacula-dir then spews out identical Intervention needed emails until manually restarted The first time this happened over a weekend and upon returning I found my inbox has about 120,000 bacula emails, all the SAME and of this type: 15-Apr 10:02 client-fd JobId 11: Fatal error: backup.c:1048 Network send error to SD. ERR=Broken pipe It happened again just now (second time since upgrading from 3.0.3 to 5.0.1) and I managed to stop the director with only a few thousand emails going out. So there are really 2 issues here: 1) Why does the director apparently get stuck in an infinite loop of sending the same email message? Is this a known bug? 2) Regarding the SD, I received one alert of this type, the rest like the above: 15-Apr 10:02 server-sd: ERROR in lock.c:268 Failed ASSERT: dev-blocked() A traceback like: -- ptrace: Operation not permitted. /var/bacula/work/29091: No such file or directory. $1 = 0 /opt/bacula-5.0.1/scripts/btraceback.gdb:2: Error in sourced command file: No symbol exename in current context. -- And a bactrace like: -- Attempt to dump current JCRs JCR=0x19a24888 JobId=10 name=client_1.2010-04-14_18.02.33_41 JobStatus=l use_count=1 JobType=B JobLevel=F sched_time=14-Apr-2010 21:35 start_time=14-Apr-2010 21:35 end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00 db=(nil) db_batch=(nil) batch_started=0 JCR=0x1981b248 JobId=11 name=client_10.2010-04-14_20.00.15_04 JobStatus=R use_count=1 JobType=B JobLevel=I sched_time=15-Apr-2010 09:15 start_time=15-Apr-2010 09:15 end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00 db=(nil) db_batch=(nil) batch_started=0 Attempt to dump plugins. Hook count=0 -- Both clients and server seem healthy, except for the SD crash. Any ideas? thanks! Stephen - Further info: My catalog... mysql-5.0.77 (64bit) MyISAM 210Gb in size 1,412,297,215 records in File table note: database built with bacula 2x scripts, upgraded with 3x scripts, then again with 5x scripts (i.e. nothing customized along the way) My OS hardware for bacula DIR+SD server... Centos 5.4 (fully patched) 8Gb RAM 2Gb Swap 1Tb EXT3 filesystem on external fiber RAID5 array (dedicated to database, incl. temp files) 2 dual-core [AMD Opteron(tm) Processor 2220] CPUs StorageTek SL500 Library with 2 LTO3 Drives -- Download Intel#174; Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-devel mailing list bacula-de...@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel -- Stephen Thompson Berkeley Seismological Laboratory step...@seismo.berkeley.edu215 McCone Hall # 4760 404.538.7077 (phone) University of California, Berkeley 510.643.5811 (fax) Berkeley, CA 94720-4760 -- Download Intel#174; Parallel Studio Eval Try the new software tools for
Re: [Bacula-users] [Bacula-devel] 5.0.1 infinite email loop bug??
On Thursday 15 April 2010 19:25:34 Stephen Thompson wrote: Hello, I have just now experienced a possible new bug with bacula 5.0.1. The symptoms are this: bacula-sd crashes bacula-dir continues to run bacula-dir then spews out identical Intervention needed emails until manually restarted The first time this happened over a weekend and upon returning I found my inbox has about 120,000 bacula emails, all the SAME and of this type: 15-Apr 10:02 client-fd JobId 11: Fatal error: backup.c:1048 Network send error to SD. ERR=Broken pipe It happened again just now (second time since upgrading from 3.0.3 to 5.0.1) and I managed to stop the director with only a few thousand emails going out. So there are really 2 issues here: 1) Why does the director apparently get stuck in an infinite loop of sending the same email message? I have no idea. Is this a known bug? No, I have never heard or seen this kind of problem before. 2) Regarding the SD, I received one alert of this type, What is an alert? Do you mean an email or a Bacula message? If so, which one? the rest like the above: 15-Apr 10:02 server-sd: ERROR in lock.c:268 Failed ASSERT: dev-blocked() It is not very clear what you are saying. Do you mean that you are receiving the above message many times? A traceback like: -- ptrace: Operation not permitted. /var/bacula/work/29091: No such file or directory. $1 = 0 /opt/bacula-5.0.1/scripts/btraceback.gdb:2: Error in sourced command file: No symbol exename in current context. -- And a bactrace like: -- Attempt to dump current JCRs JCR=0x19a24888 JobId=10 name=client_1.2010-04-14_18.02.33_41 JobStatus=l use_count=1 JobType=B JobLevel=F sched_time=14-Apr-2010 21:35 start_time=14-Apr-2010 21:35 end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00 db=(nil) db_batch=(nil) batch_started=0 JCR=0x1981b248 JobId=11 name=client_10.2010-04-14_20.00.15_04 JobStatus=R use_count=1 JobType=B JobLevel=I sched_time=15-Apr-2010 09:15 start_time=15-Apr-2010 09:15 end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00 db=(nil) db_batch=(nil) batch_started=0 Attempt to dump plugins. Hook count=0 -- Both clients and server seem healthy, except for the SD crash. Any ideas? No. To understand the problem we will need a traceback that you will probably need to produce manually as described in the Kaboom chapter, or you will need to fix the automatic traceback scripts so that they can do a ptrace. Kern thanks! Stephen --- -- Further info: My catalog... mysql-5.0.77 (64bit) MyISAM 210Gb in size 1,412,297,215 records in File table note: database built with bacula 2x scripts, upgraded with 3x scripts, then again with 5x scripts (i.e. nothing customized along the way) My OS hardware for bacula DIR+SD server... Centos 5.4 (fully patched) 8Gb RAM 2Gb Swap 1Tb EXT3 filesystem on external fiber RAID5 array (dedicated to database, incl. temp files) 2 dual-core [AMD Opteron(tm) Processor 2220] CPUs StorageTek SL500 Library with 2 LTO3 Drives --- --- Download Intel#174; Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-devel mailing list bacula-de...@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel -- Download Intel#174; Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] [Bacula-devel] 5.0.1 infinite email loop bug??
On Thursday 15 April 2010 19:36:51 Stephen Thompson wrote: Additionally, seems like the SD was possibly reading a new freshly-labeled tape when it crashed... Last items in bacula log besides alerts already mentioned: In Bacula alerts refer to tape drive information stored concerning tape problems, so I am assuming you mean messages. 15-Apr 09:31 server-sd JobId 10: Writing spooled data to Volume. Despooling 35,000,185,219 bytes ... 15-Apr 09:51 server-sd JobId 10: End of Volume FB0568 at 888:1414 on device SL500-Drive-1 (/dev/nst0). Write of 262144 bytes got -1. 15-Apr 09:51 server-sd JobId 10: Re-read of last block succeeded. 15-Apr 09:51 server-sd JobId 10: End of medium on Volume FB0568 Bytes=887,261,470,720 Blocks=3,384,635 at 15-Apr-2010 09:51. 15-Apr 09:51 server-sd JobId 10: 3307 Issuing autochanger unload slot 38, drive 1 command. 15-Apr 09:52 server-sd JobId 10: 3301 Issuing autochanger loaded? drive 1 command. 15-Apr 09:52 server-sd JobId 10: 3302 Autochanger loaded? drive 1, result: nothing loaded. 15-Apr 09:52 server-sd JobId 10: 3304 Issuing autochanger load slot 39, drive 1 command. 15-Apr 09:52 server-sd JobId 10: 3305 Autochanger load slot 39, drive 1, status is OK. 15-Apr 09:52 server-sd JobId 10: Volume FB0569 previously written, moving to end of data. Nothing but thousands of 'repetitive' alerts after that... What exactly is repeated? There was a Bacula bug #1480 in message delivery that may be the same that you are experiencing, it was triggered by a misconfigured SMTP server or by a reference in Bacula to a non-existent SMTP server - and the simple solution is to make sure Bacula points to a valid functional SMTP server. This problem was not particular to version 5.0.1, but I think it was fixed after the release of 5.0.1. Please see the bugs database for more details. Kern thanks again, Stephen On 04/15/2010 10:25 AM, Stephen Thompson wrote: Hello, I have just now experienced a possible new bug with bacula 5.0.1. The symptoms are this: bacula-sd crashes bacula-dir continues to run bacula-dir then spews out identical Intervention needed emails until manually restarted The first time this happened over a weekend and upon returning I found my inbox has about 120,000 bacula emails, all the SAME and of this type: 15-Apr 10:02 client-fd JobId 11: Fatal error: backup.c:1048 Network send error to SD. ERR=Broken pipe It happened again just now (second time since upgrading from 3.0.3 to 5.0.1) and I managed to stop the director with only a few thousand emails going out. So there are really 2 issues here: 1) Why does the director apparently get stuck in an infinite loop of sending the same email message? Is this a known bug? 2) Regarding the SD, I received one alert of this type, the rest like the above: 15-Apr 10:02 server-sd: ERROR in lock.c:268 Failed ASSERT: dev-blocked() A traceback like: -- ptrace: Operation not permitted. /var/bacula/work/29091: No such file or directory. $1 = 0 /opt/bacula-5.0.1/scripts/btraceback.gdb:2: Error in sourced command file: No symbol exename in current context. -- And a bactrace like: -- Attempt to dump current JCRs JCR=0x19a24888 JobId=10 name=client_1.2010-04-14_18.02.33_41 JobStatus=l use_count=1 JobType=B JobLevel=F sched_time=14-Apr-2010 21:35 start_time=14-Apr-2010 21:35 end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00 db=(nil) db_batch=(nil) batch_started=0 JCR=0x1981b248 JobId=11 name=client_10.2010-04-14_20.00.15_04 JobStatus=R use_count=1 JobType=B JobLevel=I sched_time=15-Apr-2010 09:15 start_time=15-Apr-2010 09:15 end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00 db=(nil) db_batch=(nil) batch_started=0 Attempt to dump plugins. Hook count=0 -- Both clients and server seem healthy, except for the SD crash. Any ideas? thanks! Stephen - Further info: My catalog... mysql-5.0.77 (64bit) MyISAM 210Gb in size 1,412,297,215 records in File table note: database built with bacula 2x scripts, upgraded with 3x scripts, then again with 5x scripts (i.e. nothing customized along the way) My OS hardware for bacula DIR+SD server... Centos 5.4 (fully patched) 8Gb RAM 2Gb Swap 1Tb EXT3 filesystem on external fiber RAID5 array (dedicated to database, incl. temp files) 2 dual-core [AMD Opteron(tm) Processor 2220] CPUs StorageTek SL500 Library with 2 LTO3 Drives - - Download Intel#174; Parallel Studio Eval Try the new software tools for
Re: [Bacula-users] [Bacula-devel] 5.0.1 infinite email loop bug??
Hello, Thanks for the response. No, it's nothing to do with mail configuration; 100% sure of that. (I know people say that all the time, but, seriously, it's the director). And by alerts, I do mean Messages in the bacula vernacular. The first time this crash happened, we received 120,000 Messages in the form of emails to our administrative account. The messages were identical both to each other and to the content of the $JOB.mail file in our bacula working directory (which is never removed automatically after one of these crashes - perhaps that causes the endless cycle). The same Message also appears to be written to our bacula log file each time an email is generated (or vice versa). It seems to me like it's possible for the director to get stuck in a loop and send the contents of that mail file again and again, infinitely. Both times we've had the SD crash (both have happened since upgrading to 5.0.1), the only thing that stopped the Message generation was stopping the director itself. Of course, that's the annoying symptom. The more serious problem is our the crash of our SD. Any pointers to getting ptrace working with the automatic scripts? thanks! Stephen On 04/15/2010 12:40 PM, Kern Sibbald wrote: On Thursday 15 April 2010 19:36:51 Stephen Thompson wrote: Additionally, seems like the SD was possibly reading a new freshly-labeled tape when it crashed... Last items in bacula log besides alerts already mentioned: In Bacula alerts refer to tape drive information stored concerning tape problems, so I am assuming you mean messages. 15-Apr 09:31 server-sd JobId 10: Writing spooled data to Volume. Despooling 35,000,185,219 bytes ... 15-Apr 09:51 server-sd JobId 10: End of Volume FB0568 at 888:1414 on device SL500-Drive-1 (/dev/nst0). Write of 262144 bytes got -1. 15-Apr 09:51 server-sd JobId 10: Re-read of last block succeeded. 15-Apr 09:51 server-sd JobId 10: End of medium on Volume FB0568 Bytes=887,261,470,720 Blocks=3,384,635 at 15-Apr-2010 09:51. 15-Apr 09:51 server-sd JobId 10: 3307 Issuing autochanger unload slot 38, drive 1 command. 15-Apr 09:52 server-sd JobId 10: 3301 Issuing autochanger loaded? drive 1 command. 15-Apr 09:52 server-sd JobId 10: 3302 Autochanger loaded? drive 1, result: nothing loaded. 15-Apr 09:52 server-sd JobId 10: 3304 Issuing autochanger load slot 39, drive 1 command. 15-Apr 09:52 server-sd JobId 10: 3305 Autochanger load slot 39, drive 1, status is OK. 15-Apr 09:52 server-sd JobId 10: Volume FB0569 previously written, moving to end of data. Nothing but thousands of 'repetitive' alerts after that... What exactly is repeated? There was a Bacula bug #1480 in message delivery that may be the same that you are experiencing, it was triggered by a misconfigured SMTP server or by a reference in Bacula to a non-existent SMTP server - and the simple solution is to make sure Bacula points to a valid functional SMTP server. This problem was not particular to version 5.0.1, but I think it was fixed after the release of 5.0.1. Please see the bugs database for more details. Kern thanks again, Stephen On 04/15/2010 10:25 AM, Stephen Thompson wrote: Hello, I have just now experienced a possible new bug with bacula 5.0.1. The symptoms are this: bacula-sd crashes bacula-dir continues to run bacula-dir then spews out identical Intervention needed emails until manually restarted The first time this happened over a weekend and upon returning I found my inbox has about 120,000 bacula emails, all the SAME and of this type: 15-Apr 10:02 client-fd JobId 11: Fatal error: backup.c:1048 Network send error to SD. ERR=Broken pipe It happened again just now (second time since upgrading from 3.0.3 to 5.0.1) and I managed to stop the director with only a few thousand emails going out. So there are really 2 issues here: 1) Why does the director apparently get stuck in an infinite loop of sending the same email message? Is this a known bug? 2) Regarding the SD, I received one alert of this type, the rest like the above: 15-Apr 10:02 server-sd: ERROR in lock.c:268 Failed ASSERT: dev-blocked() A traceback like: -- ptrace: Operation not permitted. /var/bacula/work/29091: No such file or directory. $1 = 0 /opt/bacula-5.0.1/scripts/btraceback.gdb:2: Error in sourced command file: No symbol exename in current context. -- And a bactrace like: -- Attempt to dump current JCRs JCR=0x19a24888 JobId=10 name=client_1.2010-04-14_18.02.33_41 JobStatus=l use_count=1 JobType=B JobLevel=F sched_time=14-Apr-2010 21:35 start_time=14-Apr-2010 21:35 end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00 db=(nil) db_batch=(nil) batch_started=0 JCR=0x1981b248 JobId=11 name=client_10.2010-04-14_20.00.15_04 JobStatus=R use_count=1 JobType=B JobLevel=I