Hi Kern,

Thanks for taking the time to reply!

2009/11/2 Kern Sibbald <k...@sibbald.com>:
> 1. You didn't mention what Bacula version you are using, which makes analysis
> more difficult.
Still bacula-3.0.2 on Debian Lenny amd64. I'm planning to roll a 3.0.3
.deb but I'd like to work out what these problems are first.

> 2. There is (was) a serious editing problem in the batch insert error
> handling. The code essentially attempted to edit an error message into the
> error message buffer containing the error message -- clobbered output.
Ouch. Well, I guess that explains all the binary spammed into my log
files. Is there a patch I can pull from git to prevent this?

> 3. If you bump up against a limit of maximum concurrent jobs, you will
> definitely see error message. Not much we can do about it.
I've got "Maximum Concurrent Jobs = 100" set in the director/sd, and
i'm only attempting to run 30 or so jobs at the same time at the
moment; this shouldn't increase to more than 80 or so, I hope.

> 4. There very well could be a low "max_connections" for PosgreSQL causing the
> batch insert open errors (the PostgreSQL error message was clobbered in your
> output due to a Bacula bug).
I've upped this to 256 now (from 16). Everything still broke again
last night, but I think I needed a full restart of PostgreSQL rather
than a reload to make the setting live. Interestingly, some of the
errors were slightly different this time -- the nightly incrementals
were completed correctly from the FD's point of view, but the director
never got an OK from the SD that the data had been written. I'll
update this thread again tomorrow with some re-organised logs.

> 5. We will take a look at how the error condition is propagated back from the
> Director to the SD.  I haven't looked at that yet.  If it is not properly
> reported to the SD, it could cause a bit of bottlenecking there.
If there's anything I can do to help dig up potential problems, just
let me know.

> 6. This weekend, I found (and hopefully corrected) a serious race condition
> with cancelling running Jobs in the SD. From the symptoms you describe, I am
> not sure if it is the same problem.
It's possible that it's part of what I'm seeing, I think. Certainly
when the director thinks that jobs are in a "Fatal Error" state, it is
impossible to cancel them until the storage daemon is restarted.
Again, is there a patch (or patch set) I can pull from git to see if
your changes fix my problems?

> 7. I am not quite sure why you talk about a SD "hang".  It is possible for
> jobs to get stuck for some time, and to possibly prevent new jobs from
> starting, but in most cases, they are ultimately cleaned out after the comm
> line times out.  I am not contesting that you may have found a "hang" in the
> SD, but it is hard to diagnose it with such little info.
The "hang" is the problem I'm seeing whereby jobs that have been
forced into the "Fatal Error" state (*probably* by an error from a
database connection) cannot be cancelled from the console as the
storage daemon does not respond to the director. The storage daemon is
still happily running jobs that are not in the "Fatal Error" state,
though, so perhaps describing it as a "hang" was a misnomer, and
perhaps my initial assumptions about the chain of events causing these
problems were wrong.

> 8. Such error conditions are always very hard to test, since they rarely
> occur.
I seem to have a setup that is prone to exhibiting them here, so if
there's any testing I can do to help i'm more than happy to lend
assistance ;-)

Many Thanks,
--Alex

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to