We have recently suffered a problem with the TSM client for Windows running in
an MSCS cluster caused by a memory leak in the non-paged pool.
The depletion of the non-paged memory pool only occured when disk resources were
distributed in certain ways between the two nodes of the cluster. By enabling
pool tagging and using the Microsoft PoolMon.EXE utility, we were able to
determine that the depletion of the non-paged pool occured in the pool tag
'None' and only occured when a TSM process (eg DSMC.EXE, SQLDSMC.EXE, scheduled
incremental backup) started.
Since we perform hourly TDP transaction log backups of a virtual SQL Server in
this cluster, the non-paged pool was quickly exhausted and the cluster nodes
would fail after a period of about 2 weeks.
The problem was caused by a bug in NT that has not currently been fixed in any
service pack. It relates specifically to memory that is not released if an
attempt to read a partition table returns an error. The MS Knowledge Base
article relating to the problem is Q244509 and a hotfix is available from
Microsoft.
After we applied the hotfix, the memory leak stopped. However there are a few
things I don't understand about the problem. Firstly, the TSM process didn't
have to perform a backup to cause a depletion of the non-paged pool. Indeed,
simply running DSMC QUIT would leak memory. Secondly, running a TSM process on
one node of the cluster would cause the memory leak to occur on both nodes.
I would be interested in any comments that any of the developers may have about
these experiences.
For info, the MS KB article relating to diagnosing memory leaks using PoolMon is
Q177415.
Neil Schofield
The information in this e-mail is confidential and may also be legally
privileged. The contents are intended for recipient only and are subject
to the legal notice available at http://www.keldagroup.com/email.htm
Yorkshire Water Services Limited
Registered Office 2 The Embankment Sovereign Street Leeds LS1 4BG
Registered in England and Wales No 2366682