Re: INCR backups fail ! TSM 8.1.17 Windows Server and client

David L.A. De Leeuw Wed, 23 Aug 2023 23:09:18 -0700

Good morning, TSM'ers.

My explanation of the problem, as I wrote yesterday, was wrong.
It appears some of our disks on the backup system were defined as "thin 
provisioned".
When these fill up, and cannot expand anymore, Windows will still show empty 
space on them.
I focused my attention on the disks with zero space left. But these were 
working fine.


Once I declared the "thin provisioned" disks as read-only in "update stgpooldir 
XXX access=readonly" on TSM everything works fine.
I will remove them ASAP.

Conclusion:

1. TSM works fine, but could be more tolerant if one of the stgpooldirs 
encounters a problem.
2. Windows Server works fine
3. VMWare works fine, but blocks the service without timely information to the 
end user. Only analysis of the logs point to the solution.
4. System manager (me) is too multifunctional and misses deeper understanding 
of some systems he works with.

Have a nice day

David

-----Original Message-----
From: דוד דה ליאו 
Sent: Wednesday, August 23, 2023 2:35 PM
To: ADSM: Dist Stor Manager <ADSM-L@VM.MARIST.EDU>
Cc: סער קליין - Saar Klein <saa...@bgu.ac.il>
Subject: RE: [ADSM-L] INCR backups fail ! TSM 8.1.17 Windows Server and client

Dear Chavdar, Michall and others,

Just an update on the issue I raised a few days ago concerning the failed 
backups.

Our ISP server uses a separate windows server with about 100 TB of containers, 
divided over 28 disks.
Backups started crashing about a week ago.
At some point the system would get stuck for five minutes or so.

Of course, a lot of different issues happened at the same time:

-The problem started after windows server updates, which forced us to reboot 
most of our systems.
-One user dumped about a terabyte of mostly small files on our system.
- Our Spectrum Protect system manager was on holidays.
-As always there are other usual suspects: antivirus etc.

Our container server runs on Vmware ESXI infrastructure. We opened a call to 
VMWARE, sent them the logs of the ESXI server.
They found a very simple cause to the problem: disks were filled up, and the 
system froze.

When checking the logs, I found that the backup containers opened in write mode 
were on disks without any space left, while other disks were less than half 
full.
So here is my solution: set the containerdirs that are full on read-only, move 
containers, wait till the containers are deleted. 

My question is: why is this process not managed automatically by ISP? Why are 
disks with a lot of space not prioritized for writing?

Thanks for your help !

David de Leeuw

-----Original Message-----
From: ADSM: Dist Stor Manager <ADSM-L@VM.MARIST.EDU> On Behalf Of Chavdar Cholev
Sent: Monday, August 21, 2023 6:11 PM
To: ADSM-L@VM.MARIST.EDU
Subject: Re: [ADSM-L] INCR backups fail ! TSM 8.1.17 Windows Server and client

Hi David,
Just make sure that containers are excluded from anti-virus scan.

On Sunday, August 20, 2023, David L.A. De Leeuw <da...@bgu.ac.il> wrote:

> Hi all,
>
> Apparently, this has nothing to do with SP at all !
>
> The (Windows server 2019 on ESXI) system holding the containers just 
> disconnects for 5 minutes !
>
> No pings to the server.
>
> When access is restored, later on, a message appears in the events:
> "The system time has changed to 2023-08-20T19:05:05 from
> 2023-08-20T19:01:04  "
> This is no warning even, just "information".
>
> I have no idea why this should happen, but we will find it.
> Thanks for your support !
>
> David
>
>
>
> -----Original Message-----
> From: דוד דה ליאו
> Sent: Sunday, August 20, 2023 9:37 PM
> To: ADSM: Dist Stor Manager <ADSM-L@VM.MARIST.EDU>
> Subject: RE: [ADSM-L] INCR backups fail ! TSM 8.1.17 Windows Server 
> and client
>
> Hi Michael,
>
> Thanks a lot.
>
> The SP Server is not on VM, just the storage. I am not the manager to 
> the server.
> Just got a lot of backup storage if we provide the space for the 
> containers.
>
> Sure we run a lot of sessions in parallel as you said. I will try a 
> run according to your recommendations.
> One other thought I am testing, is that over a year ago we also had 
> crashes. The 10 Gb optical network had hickups. Our 1 Gb line worked fine.
> I just switched back to the 1 Gb and see what happens.
>
> Will keep you posted !
>
> David
>
>
> -----Original Message-----
> From: ADSM: Dist Stor Manager <ADSM-L@VM.MARIST.EDU> On Behalf Of 
> Michael Prix
> Sent: Sunday, August 20, 2023 9:04 PM
> To: ADSM-L@VM.MARIST.EDU
> Subject: Re: [ADSM-L] INCR backups fail ! TSM 8.1.17 Windows Server 
> and client
>
> Hello David,
>
>   an *SP-Server in a VM is not the best setup, but nevertheless it 
> should work - and has proven so for the past.
>
> For the client: Please show the dsm.opt. I suspect you are trunning 
> several sessions from this client in parallel during a backup-> stop 
> it for the moment.
> Start with a basic dsm.opt, disable the option "resourceutilization", 
> if set,  and set "memoryefficient yes" (or "diskcachem" if you like).
> I'f it still crashes with a plain dsm.opt, you should open a ticket with IBM.
>
> --
> Michael Prix
>
>
>
>
> August 20, 2023 at 7:25 PM, "David L.A. De Leeuw" <da...@bgu.ac.il> wrote:
>
>
> >
> > Hi Chavdar and Michael,
> >
> > Thanks for your thoughts and help.
> >
> > I added "memoryefficientbackup".
> >
> > But still the sessions keep crashing. Once the session crashes, I 
> > get a
> whole bit of errors for storage pool directories, and in fact the 
> whole pool becomes unavailable.
> > I run "update stgpooldir ... access=readwrite" and all is accessible
> again.
> > Some of the containers are in unavailable state and need audit.
> >
> > Our container storage is on a Dell PowerEdge R730xd, has 24 CPU's
> allocated, 64 GB memory, 110 TB disk. The disks are declared as VMDKs.
> Network is on a 10Gb Intel 82588 card.
> > Nothing I can see points to a lack of resources.
> >
> > Everything worked fine till 4 days ago. That is why I thought of a
> problem with Windows updates, but as I rolled them back, that does not 
> make sense.
> >
> > I am quite at a loss where to look next ...
> >
> > Thanks
> >
> > David
> >
> > [Server Side] .
> > 20-08-2023 19:47:22 ANR0839I Session 197902 started for node MEDFS2
> (WinNT)
> >  (SSL medspice.bgu.ac.il[132.72.73.246]:53184) on 
> > STOREWARE13.auth.ad.bgu.ac.il:1502. (SESSION: 197902)
> > 20-08-2023 19:47:26 ANR8592I Session 197903 connection is using 
> > protocol  TLSV13, cipher specification TLS_AES_256_GCM_SHA384, 
> > certificate TSM Self-Signed Certificate. (SESSION:
> >  197903)
> > 20-08-2023 19:47:26 ANR0839I Session 197903 started for node MEDFS2
> (WinNT)
> >  (SSL medspice.bgu.ac.il[132.72.73.246]:53185) on 
> > STOREWARE13.auth.ad.bgu.ac.il:1502. (SESSION: 197903)
> > 20-08-2023 19:47:55 ANR2012W Error encountered for storage pool
> directory:
> >  \\medbackup.med.ad.bgu.ac.il\tsmc20 in storage pool:
> >  CPOOL. (SESSION: 197881)
> > 20-08-2023 19:47:55 ANR1181E sdtxn.c(1404): Data storage transaction
> >  0:83236375 was aborted. (SESSION: 197881)
> > 20-08-2023 19:47:55 ANR0204I The container state for
> >  \\medbackup.med.ad.bgu.ac.il\tsmc17\18\0000000000001853.-
> >  ncf is updated from AVAILABLE to UNAVAILABLE. (SESSION:
> >  197883)
> > 20-08-2023 19:47:55 ANR3660E An unexpected error occurred while 
> > opening
> or
> >  writing to the container. Container
> >  \\medbackup.med.ad.bgu.ac.il\tsmc17\18\0000000000001853.-
> >  ncf in stgpool CPOOL has been marked as UNAVAILABLE and  should be 
> > audited to validate accessibility and content.
> >  (SESSION: 197883)
> >
> > [From the client side:]
> >
> > During the incr of a large filespace:
> >
> > Normal File--> 7.132.827 \\medfs2\e$\medusers14\angel\17.8.23 BU -
> E\MyDocs(E)-PrevOLD-D\MyDocs (D)\PERSON-CRITER\FAMILY\OMRI's folder 
> 313843070\OMRI 1-16 medical issue\MRIs - CTs - OMRI\MY PROCESSING of 
> MRI and general MRI data\For-Crop-T2W - coronal Copy.pptx ** 
> Unsuccessful **
> > ANS1228E Sending of object '\\medfs2\e$\medusers14\angel\17.8.23 BU
> > -
> E\MyDocs(E)-PrevOLD-D\MyDocs (D)\PERSON-CRITER\FAMILY\OMRI's folder 
> 313843070\OMRI 1-16 medical issue\MRIs - CTs - OMRI\MY PROCESSING of 
> MRI and general MRI data\For-Crop-T2W - coronal Copy.pptx' failed.
> > ANS1311E Server out of data storage space
> >
> > [I ran sel of the latest file. It failed because all containerdirs 
> > were
> unavailable.]
> >
> > ANS1804E Selective Backup processing of
> > '\\medfs2\e$\medusers14\angel\17.8.23
> BU - E\MyDocs(E)-PrevOLD-D\MyDocs (D)\PERSON-CRITER\FAMILY\OMRI's 
> folder 313843070\OMRI 1-16 medical issue\MRIs - CTs - OMRI\MY 
> PROCESSING of MRI and general MRI data\For-Crop-T2W - coronal 
> Copy.pptx' finished with failures.
> >
> > Total number of objects inspected: 1 Total number of objects backed
> > up: 0 Total number of objects updated: 0 Total number of objects
> > rebound: 0 Total number of objects deleted: 0 Total number of 
> > objects expired: 0 Total number of objects failed: 1  ...
> > Network data transfer rate: 148.306,35 KB/sec Aggregate data 
> > transfer rate: 211,50 KB/sec Objects compressed by: 0% Total data 
> > reduction ratio: 0.23% Subfile objects reduced by: 0% Elapsed 
> > processing time: 00:00:32 ANS1311E Server out of data storage space
> >
> > [Then I updated the containerdirs to readwrite and ran the selective
> backup. No problem]
> > ------------------------------------------------------------
> -----------------------------------------------
> > Protect> sel '\\medfs2\e$\medusers14\angel\17.8.23 BU -
> E\MyDocs(E)-PrevOLD-D\MyDocs (D)\PERSON-CRITER\FAMILY\OMRI's folder 
> 313843070\OMRI 1-16 medical issue\MRIs - CTs - OMRI\MY PROCESSING of 
> MRI and general MRI data\For-Crop-T2W - coronal Copy.pptx'
> > Selective Backup function invoked.
> >
> > Normal File--> 7.132.827 \\medfs2\e$\medusers14\angel\17.8.23 BU -
> E\MyDocs(E)-PrevOLD-D\MyDocs (D)\PERSON-CRITER\FAMILY\OMRI's folder 
> 313843070\OMRI 1-16 medical issue\MRIs - CTs - OMRI\MY PROCESSING of 
> MRI and general MRI data\For-Crop-T2W - coronal Copy.pptx [Sent]
> > Selective Backup processing of '\\medfs2\e$\medusers14\angel\17.8.23
> > BU
> - E\MyDocs(E)-PrevOLD-D\MyDocs (D)\PERSON-CRITER\FAMILY\OMRI's folder 
> 313843070\OMRI 1-16 medical issue\MRIs - CTs - OMRI\MY PROCESSING of 
> MRI and general MRI data\For-Crop-T2W - coronal Copy.pptx' finished 
> without failure.
> >
> > -----Original Message-----
> > From: ADSM: Dist Stor Manager <ADSM-L@VM.MARIST.EDU> On Behalf Of
> Chavdar Cholev
> > Sent: Sunday, August 20, 2023 3:43 PM
> > To: ADSM-L@VM.MARIST.EDU
> > Subject: Re: [ADSM-L] INCR backups fail ! TSM 8.1.17 Windows Server 
> > and
> client
> >
> > Just to make sure that we are on the same page...
> > You have TSM installed on VM running on VMware. This VM has few LUNs
> presented and those LUN are used for containers?
> >
> > Short in the dark:
> > 1. Check VM resources if they are as IBM TSM blue print.
> > 2. Check LUNs/HDDs response time in perf. monitor. The response time
> should around 20-30 Ms during the backup operating.
> > 3. Do you know if those HDDd for LUNs are .vmdk or RDM (raw device map)?
> >
> > Thank you!
> > Chavdar
> >
> > On Saturday, August 19, 2023, David L.A. De Leeuw <da...@bgu.ac.il>
> wrote:
> >
> > >
> > > Hi TSM experts,
> > >
> > >  Our incr backup fails consistently in the last few days. It 
> > > starts  alright but after a few gigabyte on the client we get the error:
> > >
> > >  ANS1301E This operation cannot continue due to an error on the 
> > > IBM  Spectrum Protect server. See your IBM Spectrum Protect server 
> > > administrator for assistance.
> > >
> > >  On the server side we see:
> > >
> > >  18-08-2023 22:57:25 ANR2012W Error encountered for storage pool
> directory:
> > >  \\medbackup.med.ad.bgu.ac.il\tsmc1 in storage pool:
> > >  CPOOL. (SESSION: 194578)
> > >  18-08-2023 22:57:25 ANR0530W Transaction failed for session
> > > 194578
> for
> > >  node
> > >  MEDFS2 (WinNT) - internal server error detected.
> > >  (SESSION: 194578)
> > >  18-08-2023 22:57:26 ANR2012W Error encountered for storage pool
> directory:
> > >  \\medbackup.med.ad.bgu.ac.il\tsmc1 in storage pool:
> > >  CPOOL. (SESSION: 194578)
> > >
> > >  Then we find one or more containers unavailable. We fix the
> containers
> > >  with "audit container ... action=scanall"
> > >  No errors are found. But the next backup will fail again.
> > >
> > >  The server is on 8.1.17, the client as well.
> > >  The containers are on a number of disks on a shared windows 
> > > server
> 2019.
> > >  There have been some updates on the windows server recently.
> > >  (KB5029247,KB5029647)
> > >
> > >  The audits are fine, data is accessible, but backups fail.
> > >  Any ideas ?
> > >
> > >  David de Leeuw
> > >  Ben-Gurion University of the Negev  Beer Sheva Israel
> > >
> >
>

Re: INCR backups fail ! TSM 8.1.17 Windows Server and client

Reply via email to