Re: Q STG hangs during reclamation / dsmserv process hung

2006-03-07 Thread Josh-Daniel Davis

ENV: AIX 5.2 ML7+, TSM 5.3.2.3, 7026-M80, 3584 12xL2
-
PROBLEM: Does anyone know of a way to kill a migration from inside TSM
 without marking the destination pool read-only? I'm really
 trying to avoid external processes.
-
ACTION TAKEN: admin scripts previously used RECLAIM STG WAIT=YES
  and MIGRATE STG WAIT=YES.
RESULT: Q STG would hang, Some TSM Server crashes, and various other
lock issues.
-
ACTION TAKEN: I rewrote my admin scripts to use UPD STG commands again.

RESULT: The painfully visible lock issues are gone.
Migration just keeps going until it reaches the LO in effect at
the time the process started.
-
Migration is disabled at 5:55am.
At 6:05am, BA STG started.
I get these sorts of messages at least daily:
   2006-03-06 07:55:58.00 ANR0379W A server database deadlock
  situation has been encountered; the lock request for the af bitfile
  root lock, will be denied to resolve the deadlock.
   2006-03-06 07:55:58.00 ANR1181E aftxn.c(230): Data storage
  transaction 0:595528998 was aborted. (PROCESS: 206)
   2006-03-06 07:55:58.00 ANR2183W dfmigr.c(3018): Transaction
  0:595528998 was aborted. (PROCESS: 206)
   2006-03-06 07:55:58.00 ANR1033W Migration process 206 terminated
  for storage pool DISKCOL - transaction aborted. (PROCESS: 206)
-
I guess that technically, this IS a way to terminate migrations, but it's
a little spooky.

-Josh

Related thread Headers:

Date: Fri, 3 Mar 2006 22:59:02 -0600
From: Josh-Daniel Davis [EMAIL PROTECTED]
Reply-To: ADSM: Dist Stor Manager ADSM-L@VM.MARIST.EDU
To: ADSM-L@VM.MARIST.EDU
Subject: Re: dsmserv process hung.


Date: Fri, 3 Mar 2006 14:51:52 -0800
From: Larry Peifer [EMAIL PROTECTED]
Reply-To: ADSM: Dist Stor Manager ADSM-L@VM.MARIST.EDU
To: ADSM-L@VM.MARIST.EDU
Subject: Re: dsmserv process hung.

Ochs, Duane [EMAIL PROTECTED]
Sent by: ADSM: Dist Stor Manager ADSM-L@VM.MARIST.EDU
01/30/2006 12:44 PM
Please respond to
ADSM: Dist Stor Manager ADSM-L@VM.MARIST.EDU

_

Date: Wed, 1 Mar 2006 12:10:29 -0500
From: Orville Lantto [EMAIL PROTECTED]
Reply-To: ADSM: Dist Stor Manager ADSM-L@VM.MARIST.EDU
To: ADSM-L@VM.MARIST.EDU
Subject: Re: Q STG hangs during reclamation

From: ADSM: Dist Stor Manager on behalf of Rainer Wolf
Sent: Wed 3/1/2006 3:01 AM
To: ADSM-L@VM.MARIST.EDU
Subject: Re: [ADSM-L] Q STG hangs during reclamation


From: ADSM: Dist Stor Manager on behalf of Prather, Wanda
Sent: Tue 2/28/2006 4:50 PM
To: ADSM-L@VM.MARIST.EDU
Subject: [ADSM-L] 3584 help


Re: dsmserv process hung.

2006-03-03 Thread Larry Peifer
We too have just started to have this problem in the last 4 days.  In our
case the symptoms and solutions seem to fit in with what's described in
IBM Document Ref #: PK00196.  However that was to have been fixed with
5.3.1 release which we are using.  Can anyone shed more light on what
might be triggering this situation?
AIX 5.2 ML5
TSM 5.3.1.0

Here's a series of errors that cropped up this week for the first time.
Any insights would be helpful.

02/27/06   21:59:00  ANRD imgroup.c(1180): ThreadId90 Error 8
retrieving
  Backup Objects row for object 0.101495737
(SESSION: 2838)
02/27/06   21:59:00  ANRD ThreadId90 issued message  from:

  -0x00010001bf74 outDiagf
-0x0001003fb114
  imIsGroupLeader -0x000100396b9c
SmNodeSession
  -0x00010047f854 HandleNodeSession
  -0x000100485760 smExecuteSession
  -0x00010051c3e4 SessionThread
-0x0001e958
  StartThread -0x09286460 _pthread_body
(SESSION:
  2838)
02/27/06   21:59:00  ANRD smnode.c(7353): ThreadId90 Session
2838:
  Invalid Group Id 0,101495737 for ADD function
(SESSION:
  2838)
02/27/06   21:59:00  ANRD ThreadId90 issued message  from:

  -0x00010001bf74 outDiagf
-0x000100396bc4
  SmNodeSession -0x00010047f854
HandleNodeSession
  -0x000100485760 smExecuteSession
  -0x00010051c3e4 SessionThread
-0x0001e958
  StartThread -0x09286460 _pthread_body
(SESSION:
  2838)
02/28/06   23:24:55  ANRD lmlcaud.c(506): ThreadId75 Error 17
checking
  filespace data for license audit. (PROCESS: 72)

02/28/06   23:24:55  ANRD ThreadId75 issued message  from:

  -0x00010001bf74 outDiagf
-0x0001006d8e70
  LmLcAuditThread -0x0001e958 StartThread

  -0x09286460 _pthread_body  (PROCESS:
72)
03/01/06   11:20:55  ANRD lmlcaud.c(506): ThreadId43 Error 17
checking
  filespace data for license audit. (PROCESS: 79)

03/01/06   11:20:55  ANRD ThreadId43 issued message  from:

  -0x00010001bf74 outDiagf
-0x0001006d8e70
  LmLcAuditThread -0x0001e958 StartThread

  -0x09286460 _pthread_body  (PROCESS:
79)
03/03/06   03:41:10  ANRD lmlcaud.c(506): ThreadId51 Error 17
checking
  filespace data for license audit. (PROCESS: 29)

03/03/06   03:41:10  ANRD ThreadId51 issued message  from:

  -0x00010001bf74 outDiagf
-0x0001006d8e70
  LmLcAuditThread -0x0001e958 StartThread

  -0x09286460 _pthread_body  (PROCESS:
29)

In each case we need to halt and restart the TSM server to free up the
locks.  Finding slack time to do that is not always easy.





Ochs, Duane [EMAIL PROTECTED]
Sent by: ADSM: Dist Stor Manager ADSM-L@VM.MARIST.EDU
01/30/2006 12:44 PM
Please respond to
ADSM: Dist Stor Manager ADSM-L@VM.MARIST.EDU


To
ADSM-L@VM.MARIST.EDU
cc

Subject
[ADSM-L] dsmserv process hung.






AIX 5.3
TSM 5.3.1.2
This weekend one of my three TSM servers had the DSMSERV process hang.
The machine was accessible, the DSMSERV process still existed. It was
still accepting connections but not talking to them. In turn our cross
server backups and volume reconciliation hung from the the other 2 TSM
servers. One server ended up crashing due to a full recovery log. The
other was near that same point. Looks like the root cause was a full
recovery log on the hung server.

I monitor to see if DSMSERV exists, I monitor for backup and archive
failures. I use operational reporting to give me additional information
for clients. I even monitor to make sure the client scheduler is running
and communicating.

Does anybody have a method in place or an idea to monitor if the TSM
server is actually capable of communication ?

Duane Ochs
Information Systems - Enterprise Computing
Quad/Graphics Inc.
Sussex, Wisconsin
414-566-2375 phone
414-566-4010 pin# 2375 beeper
[EMAIL PROTECTED]
www.QG.com outbind://8/www.QG.com


Re: dsmserv process hung.

2006-03-03 Thread Josh-Daniel Davis

This happens when 2 threads start to back up the system object, and the
second one starts sending data before the first one is able to create the
group leader, which is the anchor for management and expiration of the
entire system object as a single entity even though it's made of multiple
objects.

As a workaround, you can set resourceutil to 2 on all of your windows
clients, do another backup of the system objects, and expire the old ones
(through policy changes or just by waiting).

The hang is related to the defect involving RESTORE STGVOL.  We had the
same problem; however, the RESTORE STGVOL process never actually made its
way into the process table.  I would initially be able to get in and HALT
dsmserv.  Officially, the defect indicated that if left to its own
devices, the lock condition would degrade to unreachability.

The fix is in 5.3.2.3.

HOWEVER, We upgraded to 5.3.2.3 and have had SERIOUS lock issues.

SHOW DEADLOCK doesn't show anything.  Actlog will periodically show a
swarm of errors about operations failing due to lock issues, similar to:

2006-02-26 13:00:18.00  ANR2033E UPDATE STGPOOL: Command failed -
lock conflict. (SESSION: 124639)
2006-02-26 13:00:18.00  ANR2033E QUERY STGPOOL: Command failed -
lock conflict. (SESSION: 124664)
2006-02-26 13:00:18.00  ANR2033E QUERY DRMEDIA: Command failed -
lock conflict. (SESSION: 124670)

and similar.

ALSO

MIGRATE STG will lock tables in such a way that Q STG will hang, but Q
PROC and Q SES work.  Client sessions will continue writing to whatever
volume they have; however, most new sessions will also hang.  Once the
offending process is killed, everything resumes.

ALSO

I've found that REPAIR STGVOL has been showing up a very often (a
subprocess of RECLAIM STG).

ALSO

Tonight, REPAIR STGVOL, 2 RECLAIM STG and one AUDIT LIC were all running
and had hung.  Unfortunately, I didn't pull dbtxn, txn, lock, etc info
prior to issuing HALT.

ALSO

dsmserv seems to chew up more CPU now than at 5.3.1.6 and 5.3.2.1;
however, I don't have quantitative measurements of the previous levels.

I'm not sure if this progression of locking issues is limited to us or is
a 5.3.2.3 problem; however, I'm very worried about the safety and
stability of TSM.


-Josh

On 06.03.03 at 14:51 [EMAIL PROTECTED] wrote:


Date: Fri, 3 Mar 2006 14:51:52 -0800
From: Larry Peifer [EMAIL PROTECTED]
Reply-To: ADSM: Dist Stor Manager ADSM-L@VM.MARIST.EDU
To: ADSM-L@VM.MARIST.EDU
Subject: Re: dsmserv process hung.

We too have just started to have this problem in the last 4 days.  In our
case the symptoms and solutions seem to fit in with what's described in
IBM Document Ref #: PK00196.  However that was to have been fixed with
5.3.1 release which we are using.  Can anyone shed more light on what
might be triggering this situation?
AIX 5.2 ML5
TSM 5.3.1.0

Here's a series of errors that cropped up this week for the first time.
Any insights would be helpful.

02/27/06   21:59:00  ANRD imgroup.c(1180): ThreadId90 Error 8
retrieving
 Backup Objects row for object 0.101495737
(SESSION: 2838)
02/27/06   21:59:00  ANRD ThreadId90 issued message  from:

 -0x00010001bf74 outDiagf
-0x0001003fb114
 imIsGroupLeader -0x000100396b9c
SmNodeSession
 -0x00010047f854 HandleNodeSession
 -0x000100485760 smExecuteSession
 -0x00010051c3e4 SessionThread
-0x0001e958
 StartThread -0x09286460 _pthread_body
(SESSION:
 2838)
02/27/06   21:59:00  ANRD smnode.c(7353): ThreadId90 Session
2838:
 Invalid Group Id 0,101495737 for ADD function
(SESSION:
 2838)
02/27/06   21:59:00  ANRD ThreadId90 issued message  from:

 -0x00010001bf74 outDiagf
-0x000100396bc4
 SmNodeSession -0x00010047f854
HandleNodeSession
 -0x000100485760 smExecuteSession
 -0x00010051c3e4 SessionThread
-0x0001e958
 StartThread -0x09286460 _pthread_body
(SESSION:
 2838)
02/28/06   23:24:55  ANRD lmlcaud.c(506): ThreadId75 Error 17
checking
 filespace data for license audit. (PROCESS: 72)

02/28/06   23:24:55  ANRD ThreadId75 issued message  from:

 -0x00010001bf74 outDiagf
-0x0001006d8e70
 LmLcAuditThread -0x0001e958 StartThread

 -0x09286460 _pthread_body  (PROCESS:
72)
03/01/06   11:20:55  ANRD lmlcaud.c(506): ThreadId43 Error 17
checking
 filespace data for license audit. (PROCESS: 79)

03/01/06   11:20:55  ANRD ThreadId43 issued message  from

dsmserv process hung.

2006-01-30 Thread Ochs, Duane
AIX 5.3
TSM 5.3.1.2
This weekend one of my three TSM servers had the DSMSERV process hang.
The machine was accessible, the DSMSERV process still existed. It was
still accepting connections but not talking to them. In turn our cross
server backups and volume reconciliation hung from the the other 2 TSM
servers. One server ended up crashing due to a full recovery log. The
other was near that same point. Looks like the root cause was a full
recovery log on the hung server. 
 
I monitor to see if DSMSERV exists, I monitor for backup and archive
failures. I use operational reporting to give me additional information
for clients. I even monitor to make sure the client scheduler is running
and communicating.   
 
Does anybody have a method in place or an idea to monitor if the TSM
server is actually capable of communication ?
 
 

Duane Ochs

Information Systems - Enterprise Computing

 

Quad/Graphics Inc.

 

Sussex, Wisconsin

414-566-2375 phone

414-566-4010 pin# 2375 beeper 

[EMAIL PROTECTED]

www.QG.com outbind://8/www.QG.com 

 


Re: dsmserv process hung.

2006-01-30 Thread Richard Sims

On Jan 30, 2006, at 3:44 PM, Ochs, Duane wrote:


AIX 5.3
TSM 5.3.1.2
This weekend one of my three TSM servers had the DSMSERV process hang.
The machine was accessible, the DSMSERV process still existed. It was
still accepting connections but not talking to them. ...


Duane - One cause of a problem of this type is a thread failure; some
key thread fails, while the rest of the process lives on, but
rather crippled. There should in any case be evidence in your Activity
Log, typically an ANR message. Where a thread failure has occurred,
there will likely be a dsmserv.err file in the server directory giving
details.


Does anybody have a method in place or an idea to monitor if the TSM
server is actually capable of communication ?


The most standardized method is to test the responsiveness of the TSM
server's Web admin port (usually, 1580). Various HTTP-based packages
can be used to do this. Here is a fragment from execution of an HTTP
prober which I wrote, to illustrate:

 http_check: Connected to HTTP server.  Now sending data...
 http_check: Request 'GET / HTTP/1.1^M^JHost: ourhost.bu.edu^M^J^M^J'
 has been sent to HTTP server '.222.333.444'.  Now
awaiting reply...
 http_check: Response took 0.009691 seconds to arrive.
 http_check: Received 2907 bytes of data from HTTP server:
 'HTTP/1.0 200 OK
 Server: ADSM_HTTP/0.1
 Content-type: text/html

 HEAD
 TITLE
 Server Administration
 /TITLE
 ...

Or you could run a TSM consolemode perl command, for example, to follow
the Activity Log and call out any irregularities.

   Richard Sims