Hi,

On 10/26/2006 6:34 PM, Kern Sibbald wrote:

>>BTW: I would not recommend 1.39 for production use yet as I see some
>>issues I find hard to analyze. For example, the DIR seems to have a
>>serious memory leak in certain situations,
> 
> 
> Can you explain the above as I don't have any credible reports of memory
> leaks in the Director (but *possibly* in the tray monitor) -- i.e. there
> are no open bug reports as of the beginning of my vacation.

Right, I didn't report anything because I haven't really investigated this.
This is what happens: 1.39.26 running. Jobs are waiting for appendable 
tape. I get mount request mails in the usual way, i.e. when Bacula needs 
a tape, after one hour, two hours later, four hours and so on.

My service monitoring shows me an increasing memory usage on the Bacula 
server.

After a while, services fail - they don't accept connections anymore - 
and the DIR is among them. Monitoring fails, too, because snmp is always 
among the failing services ;-(

When I later log in, I see that some processes are gone - snmpd, 
bacula-dir are the most important ones. Memory is available again. The 
log states that the kernel killed some processes due to memory 
allocation problems.

This is a linux 2.4.something kernel, so there is neither a way to tune 
the oom-killer actions nor does it log as extensively as does 2.6.

This happened twice.

I switched back to 1.39.24 to see if that changes things, but did not 
have an out-of-tape situation and couldn't find time to create one for 
testing purposes.

> 
>>and the SD sometimes blocks
>>waiting for jobs tapes without any possible way of getting it to
>>continue...
> 
> 
> I'd also like more information on this, unless you are referring to the
> bug report open on this which is triggered by "Always Open=no".

Again, not really investigated. Happended with .24 and .24, though.
The situation is this:
- A job with retrying setup runs, and uses one pool.
- There are no tapes from that pool loaded, so the device is blocked 
waiting for media.
- Other jobs wait for the tape drive it uses.
- The first job fails due to the client going away, the DIR reports it 
as being rescheduled.
- The job remains active in the SD but obviously doesn't do any work.

This state of things seems to not time out - I had Bacula sitting like 
this four more than six hours.

I can unmount the drive, swap tapes, 'update slots scan', mount, etc., 
but that doesn't have any effect.
I can manually cancel the job that's stuck in the SD, but that cancel 
doesn't seem to be propagated to the SD.

The only solution I found was restarting the SD. Thereby, of course, 
failing some of the waiting jobs...

> And I agree with you that 1.39, is not yet ready for critical production.

I hope to get some useful debug logs some time...

Arno

-- 
IT-Service Lehmann                    [EMAIL PROTECTED]
Arno Lehmann                  http://www.its-lehmann.de

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to