Hello, done. Found where is the problem after some more tests (and once again it is not in our hadrware or OS or broken things). It is where I initially suggested - the concurrent jobs.
After the first (and native configuration) we used (concurrent jobs, with gzip) we tested the following: 1. concurrent jobs, w/o gzip - we got similar errors (1 wrong filesize from 4 jobs, but 3 of 4 jobs with less files than expected, the 4th usually is very small - 100 files - and never had errors, so I would say 100% of jobs was invalid) 2. no concurrent jobs (Maximum Concurrent Jobs = 1 at dir and sd), w/o gzip - good news, all restores are OK, no errors, Files Expected and Files Restored match! 3. no concurrent jobs WITH gzip - again OK, all restores are OK, no errors, Files Expected and Files Restored match! So until now we have: - the problem is not caused by a corrupted file system - volumes are consistent and bls doesn't show errors - MySQL is OK (initially 4.1.x now 5.0.37) - when running concurrent jobs both 2.0.3 and 2.1.28 say backups are OK but restores fail with one of the 3 kinds of errors listed below - when concurrent jobs are turned off everything is OK - gzip on/off doesn't affect the errors Once again the 3 types of errors are: 1. some static files (i.e. not log files!) are restored with wrong (always larger) size, while first N bytes match, and the rest is filled with a part of another file (not sure if this is just file with a wrong size and some old data at the disk appears at the end, or bacula restores part of another file and append it to the end). The file can be restored correctly if marked alone but the error 3. below is generated (which seems to be just a bogus error). An example error is: --- b0: Restore_b0.d6.int.2007-07-23_22.37.34 Error: attribs.c:410 File size of restored file /home/bacula/res/b3.2/usr/src/redhat/RPMS/i686/glibc-2.2.5-44.i686.rpm not correct. Original 3826291, restored 10620921. --- When this error is present (always) the second error below (but w/o additional error messages) is present as well (missing files) 2. large amount of files are missing (while they are present in the catalog and selected) - tens of thousands (not sockets or anything else that Bacula ignores by default). When this happens usually an error like this appear (if not the first one above): --- b3: Restore_b3.d6.int.2007-07-23_17.31.47 Fatal error: Record header file index 42452 not equal record index 0 Storage: Restore_b3.d6.int.2007-07-23_17.31.47 Fatal error: read.c:124 Error sending to File daemon. ERR=Connection reset by peer Storage: Restore_b3.d6.int.2007-07-23_17.31.47 Error: bsock.c:306 Write error sending 30 bytes to client:10.2.1.13:36643: ERR=Connection reset by peer --- 3. when a file from error 1 is restored alone it is OK, but another bogus error is generated: --- Storage: Restore_b0.d6.int.2007-07-23_22.57.42 Error: block.c:275 Volume data error at 0:3999743252! Wanted ID: "BB02", got "Иnлу". Buffer discarded. --- Found that the above number (3999743252) is not present as block address for any block in the volumes, but the same number appears as part of JobMedia record in the database. This is everything in 2.1.28 sumarized, that poped up as a problem or fact. (2.0.3 had another bug with bogus errors about sockets' attributes and 2.1.26 had a bogus SQL error messages but those are fixed OK in 2.1.28). If anyone wants, feel free to reopen the bug in Mantis (903). I'm not going to do so as I am personally disappointed by the attitude "this is not a bug - work it out yourself" and the suggestion to send you our servers as a gift to test with, plus support fees... nice. Now it's up to you to create better test cases to catch more bugs if any. We will start our backup again w/o concurrent jobs and we will continue to monitor restores on a daily basis as the above tests are just 3 and I agree there is a posibility that it was just a chance that the later two tests went OK. But it was my suggestion from the beginning that the problem is Bacula damages either database numbers or volume records when concurrent jobs are running and so far the facts proved this. (!) The workaround for the problem is to switch off concurrent jobs as if not - the chance you have invalid backups are high (some 90% from our own cases and at least with our servers/os/configuration; this is so if it is not said that 100% of backups are wrong as after diff/incremental backups Bacula restores files that are deleted which is really a bad behaviour in many cases/services). Regards Tuesday, July 24, 2007, 12:15:43 AM: DL> On 23 Jul 2007 at 21:57, Doytchin Spiridonov wrote: >> Hello, >> >> I've filed this as a bug, but while Kern couldn't reproduce it he gave >> up. So let us find here what could be the problem. There are actually >> two problems, they could be linked. DL> Please. If anyone can solve the issue given what you supplied, they DL> would. You were asked to supply a reproducible situation. Hopefully DL> we can get to that position quickly without further unnecessary DL> distractions. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users