In the message dated: Mon, 06 Feb 2012 12:43:41 GMT, The pithy ruminations from Martin Simmons on <Re: [Bacula-users] critical error -- tape labels get corrupted, previous backu ps unreadable> were:
Martin, Thanks again for continuing to respond...I appreciate the feedback and troubleshooting help. => >>>>> On Fri, 03 Feb 2012 20:04:44 -0500, mark bergman said: => > => > I've added more logging to /etc/init.d/bacula-sd to confirm when tapes are => > ejected and to timestamp the SCSI release commands. => > => > Is it possible that bacula flagged tapes 003231 and 000312 as being in => > the drives because they were loaded when the server crashed, even though => > they were later ejected (outside of bacula's control)? Could this cause => > bacula to believe that the tapes were at EOT when they do get loaded, and => > bacula then immediately begins writing (corrupting the label)? [Unlikely => > that bacula would try to write before reading the label, and would then => > read the label after corrupting the tapes.] => => I don't see how this could happen. Bacula issues a rewind command when it I don't see how it could happen either....but I'm searching for any explanation. => mounts a tape and should then know that the tape is at the start. That's what I'd expect too. => => => > When the current backup is finished, I'll extract the beginning data => > on each of 003231 and 000312. Is there anything you recommend in terms => > of checking the data on tape to determine whether the tape begins with => > random garbage (possibly caused by the shutdown, startup, scsi reset, => > etc.) or if it begins with valid bacula data that happened to overwrite => > the label instead of being appended? => => Do you have a File device defined in the SD? If so, label a new File volume No. => and then append the data from the start of the tape to the end of the file => volume using dd and cat. You can then examine the file volume using bls -v -j => (the File label will allow bls to read it). Can I do this against a tape directly? => => => > Does anyone have suggestions of how to troubleshoot this further, => > or how to make the daemon startup process more resistant to causing => > any corruption? => => The important information missing is whether 000312 was already corrupted at => 01-Feb 20:11. You could add some commands to the startup part of Hmmm....The only way that I could imagine that happening is if: bacula loads the tape as needed bacula reads the volume label {somehow the tape is rewound, either when the tape is first loaded, or after some backups are written} bacula writes to tape The only thing outside of bacula that touches the tape drive in any way is the /etc/init.d/bacula-sd script, which unloads any tapes before starting the daemon & after shutting down the daemon. => /etc/init.d/bacula-sd script before it unloads all tapes. E.g. do mt status, => mt rewind and grab a copy of the first few blocks on any loaded tapes. Sure. I'm thinking that I may modify /opt/bacula/scripts/mtx-changer to replace the "unload" operation with: mt rewind dd if=$TAPE of=/opt/bacula/working/dump_$VOLUMEID.`date '+%Y-%m-%d_%T'` ibs=64k count=1024 mtx -f $ctl load $slot $drive Is that a suitable number of blocks to dump? I've got the dumps from 5 corrupted tapes, and I'm trying to see if they have anything in common (for example, maybe the first 128k is corrupted, followed by valid data from dumps that should have been appended to the tape). => => Also, you say that infrastructure1 server crashes. Maybe the crash caused the => tape to be rewound and some buffer flushed to start of the tape? I can't see how... if there was unwritten data in a buffer within the memory of the server infrastructure1, then when the server crashes it wouldn't get written to tape. The 'infrastucture' machines are part of an HA cluster...in this crash, the other nodes determined that infrastructure1 had lost communication with the quorum disk, and they powered off the node...even if that action reset the fibre loop and caused the tape library to rewind both tapes (unlikely), I don't know how any buffers on the infrastructure1 server could be written when the power was out. if there was unwritten data in a buffer within the memory of the tape library, then I believe it must be written before any rewind command will be honored. If infrastructure1 sends data to the tape drive, that data is buffered, infrastructure1 then crashes, infrastructure2 runs /etc/init.d/bacula-sd (which ejects tapes, thereby rewinding them)...the data within the buffer in the tape drive would still be written before the rewind/eject command was executed. Thanks again for your help, Mark => => __Martin => ------------------------------------------------------------------------------ Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users