Thomas, I haven't forgotten about this... I did find one place in our code where, if a memory allocation fails, a message is logged only to a trace file (not the dsmerror.log file). This is currently targeted for a future release.
I do not know if this covers all the various scenarios you encountered, but it sounds closest to the issue you initially reported (the RC 12 with no other error message). Regards, - Andy ____________________________________________________________________________ Andrew Raibeck | Tivoli Storage Manager Level 3 Technical Lead | [email protected] IBM Tivoli Storage Manager links: Product support: http://www.ibm.com/support/entry/portal/Overview/Software/Tivoli/Tivoli_Storage_Manager Online documentation: https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/Tivoli +Documentation+Central/page/Tivoli+Storage+Manager Product Wiki: https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/Tivoli +Storage+Manager/page/Home "ADSM: Dist Stor Manager" <[email protected]> wrote on 2014-07-22 10:36:01: > From: Thomas Denier <[email protected]> > To: [email protected] > Date: 2014-07-22 10:39 > Subject: Re: Backup fails with no error message > Sent by: "ADSM: Dist Stor Manager" <[email protected]> > > Andy, > > It looks like the problem was in fact a shortage of memory. The > problem starting occurring again this past Saturday. A backup with > tracing after an earlier occurrence of the problem ran out of stack > space. I attempted to raise the stack size limit from its default of > about 32 MB to its hard limit of about 4 GB before running another > backup with tracing. For some reason I only got half the requested > limit. The backup failed, but produced a useful error message for > the first time in the history of this problem: an ANS1225E message > indicating that the client software was unable to obtain memory > needed for file compression. I was able to rerun the backup > successfully after using the 'ulimit' command to allow unlimited > memory size and data segment size. The default soft limits are in > fact much smaller than the corresponding values on most of our other > systems. The default data segment size is about 128 MB and the > default memory size is about 1 GB. I am currently trying to get the > system vendor to sign off on a request to allow unlimited memory and > data segment sizes for backups of the resource group disks. > > Thomas Denier > Thomas Jefferson University Hospital > > -----Original Message----- > From: ADSM: Dist Stor Manager [mailto:[email protected]] On > Behalf Of Andrew Raibeck > Sent: Wednesday, July 09, 2014 10:44 PM > To: [email protected] > Subject: Re: [ADSM-L] Backup fails with no error message > > It is a puzzler. > > Just to verify: you have checked dsmerror.log as well for error > messages and found nothing? Another thought is to check the TSM > server activity log for any tell tale error or warning messages that > might provide a hint. > > The TSM client return codes are derived directly from the severity > of the messages issued during whatever operation is running. > ANSnnnnI messages are RC 0; ANSnnnnW are RC 8; and ANSnnnnE or > ANSnnnnS are RC 12. The exceptions are related to skipped files: > these "exception" messages are ANSnnnnE but the return code handling > sets the RC to 4. The highest severity prevails, so if, for example, > an ANSnnnnW (RC 8) and ANSnnnnE (RC 12) are issued, then the RC will > be 12. We have had the odd "skipped file" message that is not > setting the RC to 4, but those have been fixed via APARs, and in any > case I would still expect some error message in the log. If you > inspect the error log, let me > > The "GlobalRC" trace example I showed you illustrates when a non- > zero producing message sets the return code. Thus when whatever > message is processed that trips the RC 12, I would expect to see it > in the trace. If you have trace files from when the problem did not > occur, and the RC was 0, then I would not expect to see any of the > "GlobalRC" messages in the trace. > > I am a little surprised if no such error message appears in the > dsmerror.log file. I have recently seen one case where the client > experiences an "out of memory" error but no message was written to > the console, schedule log, or error log. However the SERVICE trace > is still sufficient to reveal the problem. What are the ulimits set > to for this client, and are there an unusually large number files in > any of these file systems? Are we talking about millions of files, > and maybe the file system is on the cusp of running out of memory > during backup? It's a long shot, but figured I'd mention it. > > If you are willing to continue to run the tracing, it would be a good idea. > If the problem persists but you are unable to obtain a trace, open a > PMR and we'll have to come up with an alternative way to figure out > what is going on. > > Regards, > > - Andy > > ____________________________________________________________________________ > > Andrew Raibeck | Tivoli Storage Manager Level 3 Technical Lead | > [email protected] > > IBM Tivoli Storage Manager links: > Product support: > http://www.ibm.com/support/entry/portal/Overview/Software/Tivoli/ > Tivoli_Storage_Manager > > Online documentation: > https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/Tivoli > +Documentation+Central/page/Tivoli+Storage+Manager > Product Wiki: > https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/Tivoli > +Storage+Manager/page/Home > > "ADSM: Dist Stor Manager" <[email protected]> wrote on 2014-07-09 > 11:41:51: > > > From: Thomas Denier <[email protected]> > > To: [email protected], > > Date: 2014-07-09 11:42 > > Subject: Re: Backup fails with no error message Sent by: "ADSM: Dist > > Stor Manager" <[email protected]> > > > > The regularly scheduled backup ran successfully on Tuesday morning. > > The scheduled backup this morning failed with exit status 12 and no > > error message. The backup start and end times indicated that the > > failure occurred while processing a different file system in the same > > resource group. > > > > I ran a backup of the file system with service tracing enabled. The > > TSM client eventually crashed with a segmentation fault. I found two > > trace files, neither of which contained 'GlobalRC'. The core file from > > the crash consumed nearly all of the remaining space in the root file > > system. As far as I can tell, a system administrator responding to an > > automated alert removed the core file without consulting me. > > > > I ran a backup of the entire resource group without tracing. This was > > successful. > > > > I am thinking of upgrading the client software, even though none of > > the bug fixes listed has any obvious connection to the behavior I am > > seeing. > > > > Should I just keep trying the tracing every time a backup fails and > > hope I eventually get lucky and obtain a useful trace? > > > > Thomas Denier > > Thomas Jefferson University Hospital > > > > -----Original Message----- > > From: ADSM: Dist Stor Manager [mailto:[email protected]] On Behalf > > Of Andrew Raibeck > > Sent: Monday, July 07, 2014 4:19 PM > > To: [email protected] > > Subject: Re: [ADSM-L] Backup fails with no error message > > > > Thomas, > > > > Run the failing backup command and this time add these parameters: > > > > -traceflags=service -tracefile=/sometracefilename > > > > For example: > > > > dsmc inc /main/UT -servername=DC1P1_MAIN -traceflags=service - > > tracefile=/tsmtrace.out > > > > Name the trace file whatever you want, just make sure ot put it in a > > file system with room for a potentially large trace file. > > > > Note: If you anticipate GB and GB of output, you can add the option > > -tracemax=1024 to wrap the trace file at 1 GB. The risk is, if > > whatever happens is not immediately causing the backup to stop, the > > needed trace lines could be written over due to wrapping. But based on > > your description, off-hand I'd say the backup stops when the problem > > occurs so the risk due to wrapping should be low. > > > > After the backup finishes with the RC 12, scan the trace "GlobalRC" > > (without the quotes) and you should find lines like these: > > > > 07/07/2014 16:12:14.122 [003772] [3812] : ..\..\common\ut > > \GlobalRC.cpp ( 428): msgNum = 1076 changed the Global RC. > > 07/07/2014 16:12:14.122 [003772] [3812] : ..\..\common\ut > > \GlobalRC.cpp ( 429): Old values: rc = 0, rcMacroMax = 0, rcMax = 0. > > 07/07/2014 16:12:14.122 [003772] [3812] : ..\..\common\ut > > \GlobalRC.cpp ( 443): New values: rc = 12, rcMacroMax = 12, rcMax = 12. > > > > This will show you which message is driving the RC change. In my > > example, "msgNum = 1076" corresponds to ANS1076E > > > > Based on the message, you might be able to figure out the rest; but at > > the least you have a trace file you can send in to support. > > > > Regards, > > > > - Andy > > > > > ____________________________________________________________________________ > > > > > Andrew Raibeck | Tivoli Storage Manager Level 3 Technical Lead | > > [email protected] > > > > IBM Tivoli Storage Manager links: > > Product support: > > http://www.ibm.com/support/entry/portal/Overview/Software/Tivoli/ > > Tivoli_Storage_Manager > > > > Online documentation: > > > https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/Tivoli > > +Documentation+Central/page/Tivoli+Storage+Manager > > Product Wiki: > > > https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/Tivoli > > +Storage+Manager/page/Home > > > > "ADSM: Dist Stor Manager" <[email protected]> wrote on 2014-07-07 > > 15:59:57: > > > > > From: Thomas Denier <[email protected]> > > > To: [email protected], > > > Date: 2014-07-07 16:00 > > > Subject: Backup fails with no error message Sent by: "ADSM: Dist > > > Stor Manager" <[email protected]> > > > > > > We have an AIX system on which backups of a specific file system > > > terminate with exit status 12 but with no error message indicating a > > > reason for this exit status. > > > If I execute the command > > > > > > dsmc inc /main/UT -servername=DC1P1_MAIN > > > > > > as root, I will see typical messages about the number of files > > > processed and about specific files being backed up, followed by the > > > usual summary messages. The exit status will be 12. The summary > > > statistics will show a number of files > > examined > > > equal to about half the number of files present in the file system. > > > There will not > > > be any error message explaining the exit status or the failure to > > > examine > > the > > > entire file system. > > > > > > The DCIP1_MAIN stanza in dsm.sys has some unusual features because > > > it is > > used > > > to back up one of the resource groups for a clustered environment. > > > The > > stanza > > > includes three 'domain' statements listing the file systems in the > > > resource group. > > > The stanza includes a 'nodename' option specifying the node name > > > that > > owns the > > > backup files from the resource group. The stanza includes an 'asnode' > > option > > > specifying the node name used to authenticate sessions from the > > > cluster > > node > > > involved (we and the system vendor were not able to agree on an > > acceptable > > > arrangement for storing a TSM password within the resource group). > > > This stanza works fine for the other file systems in the same > > > resource group, > > and > > > worked fine for /main/UT up until June 26. > > > > > > I have found two ways to circumvent the problem. One circumvention > > > is to > > run > > > the command > > > > > > dsmc inc /main/UT/ -subdir=y -servername=DC1P1_MAIN > > > > > > to back up the top level directory of the file system rather than > > > the file system as such. An 'lsfs' command shows nothing unusual > > > about the file system; > > it is > > > a jfs2 file system, like all the other file systems, and uses the > > > same > > mount > > > options as the other file systems. The other circumvention is to add > > > an 'exclude.dir' line for a specific subdirectory of /main/UT to the > > > include/exclude file. The subdirectory came under suspicion because > > > it was last updated a > > few > > > hours after the last fully successful backup. > > > > > > The client code is TSM 6.4.1.0. The client OS is AIX 7.1. The TSM > > > server is TSM > > > 6.2.5.0 running under zSeries Linux. > > > > > > Does anyone recognize this as a known problem? If not, does anyone > > > have suggestions for presenting the problem to TSM support? I am > > > having difficulty imagining any kind of productive interaction if I > > > don't have a message identifier to report. > > > > > > Thomas Denier > > > Thomas Jefferson University Hospital The information contained in > > > this transmission contains privileged and confidential information. > > > It is intended only for the use of the person named above. If you > > > are not the intended recipient, you are hereby notified that any > > > review, dissemination, distribution or duplication of this > > > communication is strictly prohibited. If you are not the intended > > > recipient, please contact the sender by reply email and destroy all > > > copies of the original message. > > > > > > CAUTION: Intended recipients should NOT use email communication for > > > emergent or urgent health care matters. > > > > > The information contained in this transmission contains privileged and > > confidential information. It is intended only for the use of the > > person named above. If you are not the intended recipient, you are > > hereby notified that any review, dissemination, distribution or > > duplication of this communication is strictly prohibited. If you are > > not the intended recipient, please contact the sender by reply email > > and destroy all copies of the original message. > > > > CAUTION: Intended recipients should NOT use email communication for > > emergent or urgent health care matters. > > > The information contained in this transmission contains privileged > and confidential information. It is intended only for the use of the > person named above. If you are not the intended recipient, you are > hereby notified that any review, dissemination, distribution or > duplication of this communication is strictly prohibited. If you are > not the intended recipient, please contact the sender by reply email > and destroy all copies of the original message. > > CAUTION: Intended recipients should NOT use email communication for > emergent or urgent health care matters. >
