Re: [Bacula-devel] Storage Daemon hanging, connected to maximum concurrent catalog database connection limit.

Kern Sibbald Tue, 03 Nov 2009 05:22:29 -0800

On Tuesday 03 November 2009 13:00:36 Alex Bramley wrote:
> Hi Kern,
>
> Thanks for taking the time to reply!


Thanks for the "bug report".  I believe that we have now finished correcting 
all the other problems of properly reporting the error (it was reported but 
not acted on sufficiently quickly).

The patches that you probably want to apply in the order listed below from top 
to bottom (oldest to newest) from the current git repo are:

Fix SD DCR race condition that causes seg faults  
6058e2a610de8622b259a6feed54f3af83e614d1

Fix buffer clobber when editing SQL error
5700c4a0833d4962c2b1b82cc3cb965552fc30ea

Increase width of ls size. Fixes bug #1409
090f5e1c0bd37c776650f2d5388235ef11d39dd0

Cleanup attribute catalog insert errors + cancel SD only once
84aabba7cee82f0c1f6dae8882a2ee0bb26306ca

Note, the last patch is important, and it passes all the regression tests, but 
it was implemented and committed today, so it is a bit fresh and hasn't 
passed the nightly regression on all the different platforms, which hopefully 
it will do tonight.

I don't guarantee that the patches will go easily or cleanly into 3.0.2 or 
3.0.3 as they are in our current git repo.  If you use git, you stand a good 
change of having them go in correctly, since in these kinds of situations git 
really is like magic especially if you use "cherry-pick"

See below ...

>
> 2009/11/2 Kern Sibbald <[email protected]>:
> > 1. You didn't mention what Bacula version you are using, which makes
> > analysis more difficult.
>
> Still bacula-3.0.2 on Debian Lenny amd64. I'm planning to roll a 3.0.3
> .deb but I'd like to work out what these problems are first.
>
> > 2. There is (was) a serious editing problem in the batch insert error
> > handling. The code essentially attempted to edit an error message into
> > the error message buffer containing the error message -- clobbered
> > output.
>
> Ouch. Well, I guess that explains all the binary spammed into my log
> files. Is there a patch I can pull from git to prevent this?
>
> > 3. If you bump up against a limit of maximum concurrent jobs, you will
> > definitely see error message. Not much we can do about it.
>
> I've got "Maximum Concurrent Jobs = 100" set in the director/sd, and
> i'm only attempting to run 30 or so jobs at the same time at the
> moment; this shouldn't increase to more than 80 or so, I hope.
>
> > 4. There very well could be a low "max_connections" for PosgreSQL causing
> > the batch insert open errors (the PostgreSQL error message was clobbered
> > in your output due to a Bacula bug).
>
> I've upped this to 256 now (from 16). Everything still broke again
> last night, but I think I needed a full restart of PostgreSQL rather
> than a reload to make the setting live. Interestingly, some of the
> errors were slightly different this time -- the nightly incrementals
> were completed correctly from the FD's point of view, but the director
> never got an OK from the SD that the data had been written. I'll
> update this thread again tomorrow with some re-organised logs.
>
> > 5. We will take a look at how the error condition is propagated back from
> > the Director to the SD.  I haven't looked at that yet.  If it is not
> > properly reported to the SD, it could cause a bit of bottlenecking there.
>
> If there's anything I can do to help dig up potential problems, just
> let me know.

Apply the patches noted above and force the problem to re-occur (probably a 
bit painful), and hopefully Bacula will handle it more gracefully.

>
> > 6. This weekend, I found (and hopefully corrected) a serious race
> > condition with cancelling running Jobs in the SD. From the symptoms you
> > describe, I am not sure if it is the same problem.
>
> It's possible that it's part of what I'm seeing, I think. Certainly
> when the director thinks that jobs are in a "Fatal Error" state, it is
> impossible to cancel them until the storage daemon is restarted.
> Again, is there a patch (or patch set) I can pull from git to see if
> your changes fix my problems?

The patch is the first one I list above.

>
> > 7. I am not quite sure why you talk about a SD "hang".  It is possible
> > for jobs to get stuck for some time, and to possibly prevent new jobs
> > from starting, but in most cases, they are ultimately cleaned out after
> > the comm line times out.  I am not contesting that you may have found a
> > "hang" in the SD, but it is hard to diagnose it with such little info.
>
> The "hang" is the problem I'm seeing whereby jobs that have been
> forced into the "Fatal Error" state (*probably* by an error from a
> database connection) cannot be cancelled from the console as the
> storage daemon does not respond to the director. The storage daemon is
> still happily running jobs that are not in the "Fatal Error" state,
> though, so perhaps describing it as a "hang" was a misnomer, and
> perhaps my initial assumptions about the chain of events causing these
> problems were wrong.

I certainly hope these "hang" problems are resolved by my  Fix SD DCR race ... 
patch.

>
> > 8. Such error conditions are always very hard to test, since they rarely
> > occur.
>
> I seem to have a setup that is prone to exhibiting them here, so if
> there's any testing I can do to help i'm more than happy to lend
> assistance ;-)

Any feedback on whether or not the patches are effective would be much 
appreciated.

Best regards,

Kern


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] Storage Daemon hanging, connected to maximum concurrent catalog database connection limit.

Reply via email to