Re: Q STG hangs during reclamation / dsmserv process hung
ENV: AIX 5.2 ML7+, TSM 5.3.2.3, 7026-M80, 3584 12xL2 - PROBLEM: Does anyone know of a way to kill a migration from inside TSM without marking the destination pool read-only? I'm really trying to avoid external processes. - ACTION TAKEN: admin scripts previously used RECLAIM STG WAIT=YES and MIGRATE STG WAIT=YES. RESULT: Q STG would hang, Some TSM Server crashes, and various other lock issues. - ACTION TAKEN: I rewrote my admin scripts to use UPD STG commands again. RESULT: The painfully visible lock issues are gone. Migration just keeps going until it reaches the LO in effect at the time the process started. - Migration is disabled at 5:55am. At 6:05am, BA STG started. I get these sorts of messages at least daily: 2006-03-06 07:55:58.00 ANR0379W A server database deadlock situation has been encountered; the lock request for the af bitfile root lock, will be denied to resolve the deadlock. 2006-03-06 07:55:58.00 ANR1181E aftxn.c(230): Data storage transaction 0:595528998 was aborted. (PROCESS: 206) 2006-03-06 07:55:58.00 ANR2183W dfmigr.c(3018): Transaction 0:595528998 was aborted. (PROCESS: 206) 2006-03-06 07:55:58.00 ANR1033W Migration process 206 terminated for storage pool DISKCOL - transaction aborted. (PROCESS: 206) - I guess that technically, this IS a way to terminate migrations, but it's a little spooky. -Josh Related thread Headers: Date: Fri, 3 Mar 2006 22:59:02 -0600 From: Josh-Daniel Davis [EMAIL PROTECTED] Reply-To: ADSM: Dist Stor Manager ADSM-L@VM.MARIST.EDU To: ADSM-L@VM.MARIST.EDU Subject: Re: dsmserv process hung. Date: Fri, 3 Mar 2006 14:51:52 -0800 From: Larry Peifer [EMAIL PROTECTED] Reply-To: ADSM: Dist Stor Manager ADSM-L@VM.MARIST.EDU To: ADSM-L@VM.MARIST.EDU Subject: Re: dsmserv process hung. Ochs, Duane [EMAIL PROTECTED] Sent by: ADSM: Dist Stor Manager ADSM-L@VM.MARIST.EDU 01/30/2006 12:44 PM Please respond to ADSM: Dist Stor Manager ADSM-L@VM.MARIST.EDU _ Date: Wed, 1 Mar 2006 12:10:29 -0500 From: Orville Lantto [EMAIL PROTECTED] Reply-To: ADSM: Dist Stor Manager ADSM-L@VM.MARIST.EDU To: ADSM-L@VM.MARIST.EDU Subject: Re: Q STG hangs during reclamation From: ADSM: Dist Stor Manager on behalf of Rainer Wolf Sent: Wed 3/1/2006 3:01 AM To: ADSM-L@VM.MARIST.EDU Subject: Re: [ADSM-L] Q STG hangs during reclamation From: ADSM: Dist Stor Manager on behalf of Prather, Wanda Sent: Tue 2/28/2006 4:50 PM To: ADSM-L@VM.MARIST.EDU Subject: [ADSM-L] 3584 help
Re: dsmserv process hung.
We too have just started to have this problem in the last 4 days. In our case the symptoms and solutions seem to fit in with what's described in IBM Document Ref #: PK00196. However that was to have been fixed with 5.3.1 release which we are using. Can anyone shed more light on what might be triggering this situation? AIX 5.2 ML5 TSM 5.3.1.0 Here's a series of errors that cropped up this week for the first time. Any insights would be helpful. 02/27/06 21:59:00 ANRD imgroup.c(1180): ThreadId90 Error 8 retrieving Backup Objects row for object 0.101495737 (SESSION: 2838) 02/27/06 21:59:00 ANRD ThreadId90 issued message from: -0x00010001bf74 outDiagf -0x0001003fb114 imIsGroupLeader -0x000100396b9c SmNodeSession -0x00010047f854 HandleNodeSession -0x000100485760 smExecuteSession -0x00010051c3e4 SessionThread -0x0001e958 StartThread -0x09286460 _pthread_body (SESSION: 2838) 02/27/06 21:59:00 ANRD smnode.c(7353): ThreadId90 Session 2838: Invalid Group Id 0,101495737 for ADD function (SESSION: 2838) 02/27/06 21:59:00 ANRD ThreadId90 issued message from: -0x00010001bf74 outDiagf -0x000100396bc4 SmNodeSession -0x00010047f854 HandleNodeSession -0x000100485760 smExecuteSession -0x00010051c3e4 SessionThread -0x0001e958 StartThread -0x09286460 _pthread_body (SESSION: 2838) 02/28/06 23:24:55 ANRD lmlcaud.c(506): ThreadId75 Error 17 checking filespace data for license audit. (PROCESS: 72) 02/28/06 23:24:55 ANRD ThreadId75 issued message from: -0x00010001bf74 outDiagf -0x0001006d8e70 LmLcAuditThread -0x0001e958 StartThread -0x09286460 _pthread_body (PROCESS: 72) 03/01/06 11:20:55 ANRD lmlcaud.c(506): ThreadId43 Error 17 checking filespace data for license audit. (PROCESS: 79) 03/01/06 11:20:55 ANRD ThreadId43 issued message from: -0x00010001bf74 outDiagf -0x0001006d8e70 LmLcAuditThread -0x0001e958 StartThread -0x09286460 _pthread_body (PROCESS: 79) 03/03/06 03:41:10 ANRD lmlcaud.c(506): ThreadId51 Error 17 checking filespace data for license audit. (PROCESS: 29) 03/03/06 03:41:10 ANRD ThreadId51 issued message from: -0x00010001bf74 outDiagf -0x0001006d8e70 LmLcAuditThread -0x0001e958 StartThread -0x09286460 _pthread_body (PROCESS: 29) In each case we need to halt and restart the TSM server to free up the locks. Finding slack time to do that is not always easy. Ochs, Duane [EMAIL PROTECTED] Sent by: ADSM: Dist Stor Manager ADSM-L@VM.MARIST.EDU 01/30/2006 12:44 PM Please respond to ADSM: Dist Stor Manager ADSM-L@VM.MARIST.EDU To ADSM-L@VM.MARIST.EDU cc Subject [ADSM-L] dsmserv process hung. AIX 5.3 TSM 5.3.1.2 This weekend one of my three TSM servers had the DSMSERV process hang. The machine was accessible, the DSMSERV process still existed. It was still accepting connections but not talking to them. In turn our cross server backups and volume reconciliation hung from the the other 2 TSM servers. One server ended up crashing due to a full recovery log. The other was near that same point. Looks like the root cause was a full recovery log on the hung server. I monitor to see if DSMSERV exists, I monitor for backup and archive failures. I use operational reporting to give me additional information for clients. I even monitor to make sure the client scheduler is running and communicating. Does anybody have a method in place or an idea to monitor if the TSM server is actually capable of communication ? Duane Ochs Information Systems - Enterprise Computing Quad/Graphics Inc. Sussex, Wisconsin 414-566-2375 phone 414-566-4010 pin# 2375 beeper [EMAIL PROTECTED] www.QG.com outbind://8/www.QG.com
Re: dsmserv process hung.
This happens when 2 threads start to back up the system object, and the second one starts sending data before the first one is able to create the group leader, which is the anchor for management and expiration of the entire system object as a single entity even though it's made of multiple objects. As a workaround, you can set resourceutil to 2 on all of your windows clients, do another backup of the system objects, and expire the old ones (through policy changes or just by waiting). The hang is related to the defect involving RESTORE STGVOL. We had the same problem; however, the RESTORE STGVOL process never actually made its way into the process table. I would initially be able to get in and HALT dsmserv. Officially, the defect indicated that if left to its own devices, the lock condition would degrade to unreachability. The fix is in 5.3.2.3. HOWEVER, We upgraded to 5.3.2.3 and have had SERIOUS lock issues. SHOW DEADLOCK doesn't show anything. Actlog will periodically show a swarm of errors about operations failing due to lock issues, similar to: 2006-02-26 13:00:18.00 ANR2033E UPDATE STGPOOL: Command failed - lock conflict. (SESSION: 124639) 2006-02-26 13:00:18.00 ANR2033E QUERY STGPOOL: Command failed - lock conflict. (SESSION: 124664) 2006-02-26 13:00:18.00 ANR2033E QUERY DRMEDIA: Command failed - lock conflict. (SESSION: 124670) and similar. ALSO MIGRATE STG will lock tables in such a way that Q STG will hang, but Q PROC and Q SES work. Client sessions will continue writing to whatever volume they have; however, most new sessions will also hang. Once the offending process is killed, everything resumes. ALSO I've found that REPAIR STGVOL has been showing up a very often (a subprocess of RECLAIM STG). ALSO Tonight, REPAIR STGVOL, 2 RECLAIM STG and one AUDIT LIC were all running and had hung. Unfortunately, I didn't pull dbtxn, txn, lock, etc info prior to issuing HALT. ALSO dsmserv seems to chew up more CPU now than at 5.3.1.6 and 5.3.2.1; however, I don't have quantitative measurements of the previous levels. I'm not sure if this progression of locking issues is limited to us or is a 5.3.2.3 problem; however, I'm very worried about the safety and stability of TSM. -Josh On 06.03.03 at 14:51 [EMAIL PROTECTED] wrote: Date: Fri, 3 Mar 2006 14:51:52 -0800 From: Larry Peifer [EMAIL PROTECTED] Reply-To: ADSM: Dist Stor Manager ADSM-L@VM.MARIST.EDU To: ADSM-L@VM.MARIST.EDU Subject: Re: dsmserv process hung. We too have just started to have this problem in the last 4 days. In our case the symptoms and solutions seem to fit in with what's described in IBM Document Ref #: PK00196. However that was to have been fixed with 5.3.1 release which we are using. Can anyone shed more light on what might be triggering this situation? AIX 5.2 ML5 TSM 5.3.1.0 Here's a series of errors that cropped up this week for the first time. Any insights would be helpful. 02/27/06 21:59:00 ANRD imgroup.c(1180): ThreadId90 Error 8 retrieving Backup Objects row for object 0.101495737 (SESSION: 2838) 02/27/06 21:59:00 ANRD ThreadId90 issued message from: -0x00010001bf74 outDiagf -0x0001003fb114 imIsGroupLeader -0x000100396b9c SmNodeSession -0x00010047f854 HandleNodeSession -0x000100485760 smExecuteSession -0x00010051c3e4 SessionThread -0x0001e958 StartThread -0x09286460 _pthread_body (SESSION: 2838) 02/27/06 21:59:00 ANRD smnode.c(7353): ThreadId90 Session 2838: Invalid Group Id 0,101495737 for ADD function (SESSION: 2838) 02/27/06 21:59:00 ANRD ThreadId90 issued message from: -0x00010001bf74 outDiagf -0x000100396bc4 SmNodeSession -0x00010047f854 HandleNodeSession -0x000100485760 smExecuteSession -0x00010051c3e4 SessionThread -0x0001e958 StartThread -0x09286460 _pthread_body (SESSION: 2838) 02/28/06 23:24:55 ANRD lmlcaud.c(506): ThreadId75 Error 17 checking filespace data for license audit. (PROCESS: 72) 02/28/06 23:24:55 ANRD ThreadId75 issued message from: -0x00010001bf74 outDiagf -0x0001006d8e70 LmLcAuditThread -0x0001e958 StartThread -0x09286460 _pthread_body (PROCESS: 72) 03/01/06 11:20:55 ANRD lmlcaud.c(506): ThreadId43 Error 17 checking filespace data for license audit. (PROCESS: 79) 03/01/06 11:20:55 ANRD ThreadId43 issued message from
dsmserv process hung.
AIX 5.3 TSM 5.3.1.2 This weekend one of my three TSM servers had the DSMSERV process hang. The machine was accessible, the DSMSERV process still existed. It was still accepting connections but not talking to them. In turn our cross server backups and volume reconciliation hung from the the other 2 TSM servers. One server ended up crashing due to a full recovery log. The other was near that same point. Looks like the root cause was a full recovery log on the hung server. I monitor to see if DSMSERV exists, I monitor for backup and archive failures. I use operational reporting to give me additional information for clients. I even monitor to make sure the client scheduler is running and communicating. Does anybody have a method in place or an idea to monitor if the TSM server is actually capable of communication ? Duane Ochs Information Systems - Enterprise Computing Quad/Graphics Inc. Sussex, Wisconsin 414-566-2375 phone 414-566-4010 pin# 2375 beeper [EMAIL PROTECTED] www.QG.com outbind://8/www.QG.com
Re: dsmserv process hung.
On Jan 30, 2006, at 3:44 PM, Ochs, Duane wrote: AIX 5.3 TSM 5.3.1.2 This weekend one of my three TSM servers had the DSMSERV process hang. The machine was accessible, the DSMSERV process still existed. It was still accepting connections but not talking to them. ... Duane - One cause of a problem of this type is a thread failure; some key thread fails, while the rest of the process lives on, but rather crippled. There should in any case be evidence in your Activity Log, typically an ANR message. Where a thread failure has occurred, there will likely be a dsmserv.err file in the server directory giving details. Does anybody have a method in place or an idea to monitor if the TSM server is actually capable of communication ? The most standardized method is to test the responsiveness of the TSM server's Web admin port (usually, 1580). Various HTTP-based packages can be used to do this. Here is a fragment from execution of an HTTP prober which I wrote, to illustrate: http_check: Connected to HTTP server. Now sending data... http_check: Request 'GET / HTTP/1.1^M^JHost: ourhost.bu.edu^M^J^M^J' has been sent to HTTP server '.222.333.444'. Now awaiting reply... http_check: Response took 0.009691 seconds to arrive. http_check: Received 2907 bytes of data from HTTP server: 'HTTP/1.0 200 OK Server: ADSM_HTTP/0.1 Content-type: text/html HEAD TITLE Server Administration /TITLE ... Or you could run a TSM consolemode perl command, for example, to follow the Activity Log and call out any irregularities. Richard Sims