Hi Kern, Thanks for taking the time to reply!
2009/11/2 Kern Sibbald <k...@sibbald.com>: > 1. You didn't mention what Bacula version you are using, which makes analysis > more difficult. Still bacula-3.0.2 on Debian Lenny amd64. I'm planning to roll a 3.0.3 .deb but I'd like to work out what these problems are first. > 2. There is (was) a serious editing problem in the batch insert error > handling. The code essentially attempted to edit an error message into the > error message buffer containing the error message -- clobbered output. Ouch. Well, I guess that explains all the binary spammed into my log files. Is there a patch I can pull from git to prevent this? > 3. If you bump up against a limit of maximum concurrent jobs, you will > definitely see error message. Not much we can do about it. I've got "Maximum Concurrent Jobs = 100" set in the director/sd, and i'm only attempting to run 30 or so jobs at the same time at the moment; this shouldn't increase to more than 80 or so, I hope. > 4. There very well could be a low "max_connections" for PosgreSQL causing the > batch insert open errors (the PostgreSQL error message was clobbered in your > output due to a Bacula bug). I've upped this to 256 now (from 16). Everything still broke again last night, but I think I needed a full restart of PostgreSQL rather than a reload to make the setting live. Interestingly, some of the errors were slightly different this time -- the nightly incrementals were completed correctly from the FD's point of view, but the director never got an OK from the SD that the data had been written. I'll update this thread again tomorrow with some re-organised logs. > 5. We will take a look at how the error condition is propagated back from the > Director to the SD. I haven't looked at that yet. If it is not properly > reported to the SD, it could cause a bit of bottlenecking there. If there's anything I can do to help dig up potential problems, just let me know. > 6. This weekend, I found (and hopefully corrected) a serious race condition > with cancelling running Jobs in the SD. From the symptoms you describe, I am > not sure if it is the same problem. It's possible that it's part of what I'm seeing, I think. Certainly when the director thinks that jobs are in a "Fatal Error" state, it is impossible to cancel them until the storage daemon is restarted. Again, is there a patch (or patch set) I can pull from git to see if your changes fix my problems? > 7. I am not quite sure why you talk about a SD "hang". It is possible for > jobs to get stuck for some time, and to possibly prevent new jobs from > starting, but in most cases, they are ultimately cleaned out after the comm > line times out. I am not contesting that you may have found a "hang" in the > SD, but it is hard to diagnose it with such little info. The "hang" is the problem I'm seeing whereby jobs that have been forced into the "Fatal Error" state (*probably* by an error from a database connection) cannot be cancelled from the console as the storage daemon does not respond to the director. The storage daemon is still happily running jobs that are not in the "Fatal Error" state, though, so perhaps describing it as a "hang" was a misnomer, and perhaps my initial assumptions about the chain of events causing these problems were wrong. > 8. Such error conditions are always very hard to test, since they rarely > occur. I seem to have a setup that is prone to exhibiting them here, so if there's any testing I can do to help i'm more than happy to lend assistance ;-) Many Thanks, --Alex ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel