Hello,

done. Found where is the problem after some more tests (and once again
it is not in our hadrware or OS or broken things). It is where I
initially suggested - the concurrent jobs.

After the first (and native configuration) we used (concurrent jobs,
with gzip) we tested the following:

1. concurrent jobs, w/o gzip
- we got similar errors (1 wrong filesize from 4 jobs, but 3 of 4 jobs
with less files than expected, the 4th usually is very small - 100
files - and never had errors, so I would say 100% of jobs was invalid)

2. no concurrent jobs (Maximum Concurrent Jobs = 1 at dir and sd), w/o
gzip
- good news, all restores are OK, no errors, Files Expected and Files
Restored match!

3. no concurrent jobs WITH gzip
- again OK, all restores are OK, no errors, Files Expected and Files
Restored match!

So until now we have:
- the problem is not caused by a corrupted file system
- volumes are consistent and bls doesn't show errors
- MySQL is OK (initially 4.1.x now 5.0.37)
- when running concurrent jobs both 2.0.3 and 2.1.28 say backups are
OK but restores fail with one of the 3 kinds of errors listed below
- when concurrent jobs are turned off everything is OK
- gzip on/off doesn't affect the errors

Once again the 3 types of errors are:

1. some static files (i.e. not log files!) are restored with wrong
(always larger) size, while first N bytes match, and the rest is
filled with a part of another file (not sure if this is just file with
a wrong size and some old data at the disk appears at the end, or
bacula restores part of another file and append it to the end). The
file can be restored correctly if marked alone but the error 3. below
is generated (which seems to be just a bogus error). An example error is:
---
b0: Restore_b0.d6.int.2007-07-23_22.37.34 Error: attribs.c:410 File size
of restored file
/home/bacula/res/b3.2/usr/src/redhat/RPMS/i686/glibc-2.2.5-44.i686.rpm
not correct. Original 3826291, restored 10620921.
---
When this error is present (always) the second error below (but w/o
additional error messages) is present as well (missing files)

2. large amount of files are missing (while they are present in the
catalog and selected) - tens of thousands (not sockets or anything
else that Bacula ignores by default). When this happens usually an
error like this appear (if not the first one above):
---
b3: Restore_b3.d6.int.2007-07-23_17.31.47 Fatal error: Record header
file index 42452 not equal record index 0
Storage: Restore_b3.d6.int.2007-07-23_17.31.47 Fatal error: read.c:124
Error sending to File daemon. ERR=Connection reset by peer
Storage: Restore_b3.d6.int.2007-07-23_17.31.47 Error: bsock.c:306
Write error sending 30 bytes to client:10.2.1.13:36643: ERR=Connection
reset by peer
---

3. when a file from error 1 is restored alone it is OK, but another
bogus error is generated:
---
Storage: Restore_b0.d6.int.2007-07-23_22.57.42 Error: block.c:275
Volume data error at 0:3999743252! Wanted ID: "BB02", got "Иnлу".
Buffer discarded.
---
Found that the above number (3999743252) is not present as block
address for any block in the volumes, but the same number appears as
part of JobMedia record in the database.


This is everything in 2.1.28 sumarized, that poped up as a problem or
fact.
(2.0.3 had another bug with bogus errors about sockets' attributes and
2.1.26 had a bogus SQL error messages but those are fixed OK in
2.1.28).

If anyone wants, feel free to reopen the bug in Mantis (903). I'm not
going to do so as I am personally disappointed by the attitude "this
is not a bug - work it out yourself" and the suggestion to send you
our servers as a gift to test with, plus support fees... nice. Now
it's up to you to create better test cases to catch more bugs if any.

We will start our backup again w/o concurrent jobs and we will
continue to monitor restores on a daily basis as the above tests are
just 3 and I agree there is a posibility that it was just a chance
that the later two tests went OK. But it was my suggestion from the
beginning that the problem is Bacula damages either database numbers
or volume records when concurrent jobs are running and so far the
facts proved this.


(!) The workaround for the problem is to switch off concurrent jobs as
if not - the chance you have invalid backups are high (some 90% from
our own cases and at least with our servers/os/configuration; this is
so if it is not said that 100% of backups are wrong as after
diff/incremental backups Bacula restores files that are deleted which
is really a bad behaviour in many cases/services).


Regards



Tuesday, July 24, 2007, 12:15:43 AM:

DL> On 23 Jul 2007 at 21:57, Doytchin Spiridonov wrote:

>> Hello,
>> 
>> I've filed this as a bug, but while Kern couldn't reproduce it he gave
>> up. So let us find here what could be the problem. There are actually
>> two problems, they could be linked.

DL> Please.  If anyone can solve the issue given what you supplied, they 
DL> would.  You were asked to supply a reproducible situation.  Hopefully 
DL> we can get to that position quickly without further unnecessary 
DL> distractions.



-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to