Don V Nielsen asks: | Where might one one find good instruction on how to read a dump?
For the most part, everybody I know that's any good at it got that way all on their own (they may have taken a class from IBM or Amdahl back when they were young and green, however, but nobody ever told me they learned it in a classroom setting). But (in the past, mind you) I have managed to teach a few how to do it (who later got good). Nevertheless, it was really the students themselves and their own personal effort that brought their skill level up to "great." There are different types of dumps, and therefore different distributions of dump reading skills that and needed and exist. The focus of this list is Assembler, and if you're an Assembler or even a high-level language (application) programmer, dealing with code that, for the most part, runs in problem state, key 8, then (except when your code branches out into the ozone) the skills you need to read any dump resulting from something you did that turned out to be ill-advised are very, very different from the skills needed to read the dumps that my group reads. When our stuff goes South, we may or may not even be in our own code, and we may or may not even be in one of our address spaces. Much of the storage we peek at is not ours, and even when the bug is in our code it can frequently seem like it's not and nothing that is close at hand in the address space to the current PSW or registers seems to have anything to do with our playpen. The skills needed to figure all THAT out are completely different from "why did my code ABEND S0C4 because I loaded a bad pointer to something in R3?" I think some of the best training can be gleaned when you are in a position to cause your own problems, which you then have to debug by reading the resulting dump. If you have to read a dump caused by something you didn't do, and which you can't even fix, I think one of the essential motivations is simply absent. On the other hand, even if you didn't do it, but you are in a position to fix it (potentially), then there is at least some incentive to dig and understand. All that said, for what we do, figuring out what happened, what the sequence of events were that led up to the failure, requires an understanding in many, many cases of what's going on inside the operating system at the component and interrupt level. Sure, there are the tools (like IPCS) that one needs to learn how to use, but if one does not understand what all those things in the Assembler Services manualS do, and what control blocks they cause to be created, and what events they cause to happen when called, and what all that stuff looks like when a system is running normally, then it's all complete gobbledygook. I can well imagine that for someone who isn't versed in MVS services and internals, looking at a system dump using IPCS could well be like me looking at a Windows storage dump. I've looked over the shoulders of a genuine guru doing that, using tools and facilities that I had never before been aware of. That made it clear to me that if I didn't have an intimate understanding of Win32S (or whatever API was being used), even the few things that could be formatted by some dump reading/viewing tool would forever remain obscure. That is a completely different world, and although we have folks that are immersed in it, that's not me. But more important to this discussion, I think that only the most simple of the "dump reading" skills that I do have would apply to that environment. I know how to debug code, but to debug it from just a dump (which is frequently all you have) requires a set of skills that are dependent on the type of bug it is and the environment in which it manifests itself. If your own code did it, and no other code is involved (meaning no operating system services), it just requires "debugging skills" and the only "dump reading" skills needed are how to use the tool to move around in the dump and find stuff. But if your code didn't do it, or your code did do it but interacted with other code, potentially messing it up, much more than just "debugging skills" are needed. For example, when things go South because you stomped on somebody else's storage (or they on yours), the bug may not even be in "your code" that you are responsible for maintaining. The bug may be in some other product or some other component or even in some other vendor's product (including IBM's stuff or in z/OS itself). Sometimes we have only a dump with storage (of ours) that [we later determined] got overlaid literally weeks or months before something failed and the event came to human attention. Finding the footprints of something like that in a dump usually requires luck of course, but a completely different set of skills than navigating your way around something more straightforward. For folks that want to learn how to read system dumps that are the consequence of failures in the operating system itself, or products like ours that behave like they are part of the operating system, I usually answer the question this way: First, learn how to use every service and function that IBM documents. Write a program to use it. Then write three more. Lather, rinse, repeat. Take or cause dumps in the middle of all that stuff going on and see what things look like in storage when you do so. Learn to recognize the several dozen major task, storage, data management, and contents supervision control blocks on sight, in hex, and what things you did (called) in your program then look like in the system trace, et al. Learn the hex codes for at least the most frequently appearing 150 instructions on sight (and if you don't know how they work, write code that uses them until you do). In other words, learn how to disassemble instructions in your mind, on the fly, without having to look anything up, from the hex in a dump into Assembler source. Learn the offsets of pointers to significant operating system task and data management control blocks in other control blocks. Where is the pointer to the top RB in a TCB? Where is R11 in a savearea? Where is the pointer to the UCB in a DEB, and where is the pointer to the DEB in the DCB? Is that a DCB or an ACB? Is that a DSCB or a JFCB? Where is the TIOT and is that entry in use or not? Where are the registers in a LSE? You don't have time to go look this stuff up or go back to the formatted dump to find them. You have to recognize and be able to follow things to get somewhere else literally on sight, else it will take days to read a dump that someone else can polish off in 10 minutes. I think you first have to learn what a well-running system looks like internally. Only then can you recognize what hit the fan when something goes awry. I don't know any way to do that other than getting years of experience under your belt, doing and learning things, step by step. But how does one learn what 20 years' experience yields doing stuff like that? Well, first, you need to spend 20 years doing stuff like that. But, you have to DO it. Nobody is going to teach you everything you need to know. You have to teach yourself. That's the way it was nearly 50 years ago when I started working with System/360 and OS/360. There was nobody around to teach any of us. Of course, at that time, dumps were printed and required no more than 27 pages (standalone or not). Later, a standalone dump for a program check in NIP at IPL on a 512 KB 360/75, however, took 274 pages and 15 minutes to print on a 1403-N1 printer. I remember we thought we were hot stuff, then. Nothing changes. -- WB
