Don V Nielsen asks:

| Where might one one find good instruction on how to read a dump?

For the most part, everybody I know that's any good at it got that way all
on their own (they may have taken a class from IBM or Amdahl back when they
were young and green, however, but nobody ever told me they learned it in a
classroom setting). But (in the past, mind you) I have managed to teach a
few how to do it (who later got good). Nevertheless, it was really the
students themselves and their own personal effort that brought their skill
level up to "great."

There are different types of dumps, and therefore different distributions of
dump reading skills that and needed and exist.

The focus of this list is Assembler, and if you're an Assembler or even a
high-level language (application) programmer, dealing with code that, for
the most part, runs in problem state, key 8, then (except when your code
branches out into the ozone) the skills you need to read any dump resulting
from something you did that turned out to be ill-advised are very, very
different from the skills needed to read the dumps that my group reads. When
our stuff goes South, we may or may not even be in our own code, and we may
or may not even be in one of our address spaces. Much of the storage we peek
at is not ours, and even when the bug is in our code it can frequently seem
like it's not and nothing that is close at hand in the address space to the
current PSW or registers seems to have anything to do with our playpen. The
skills needed to figure all THAT out are completely different from "why did
my code ABEND S0C4 because I loaded a bad pointer to something in R3?"

I think some of the best training can be gleaned when you are in a position
to cause your own problems, which you then have to debug by reading the
resulting dump. If you have to read a dump caused by something you didn't
do, and which you can't even fix, I think one of the essential motivations
is simply absent. On the other hand, even if you didn't do it, but you are
in a position to fix it (potentially), then there is at least some incentive
to dig and understand.

All that said, for what we do, figuring out what happened, what the sequence
of events were that led up to the failure, requires an understanding in
many, many cases of what's going on inside the operating system at the
component and interrupt level. Sure, there are the tools (like IPCS) that
one needs to learn how to use, but if one does not understand what all those
things in the Assembler Services manualS do, and what control blocks they
cause to be created, and what events they cause to happen when called, and
what all that stuff looks like when a system is running normally, then it's
all complete gobbledygook.

I can well imagine that for someone who isn't versed in MVS services and
internals, looking at a system dump using IPCS could well be like me looking
at a Windows storage dump. I've looked over the shoulders of a genuine guru
doing that, using tools and facilities that I had never before been aware
of. That made it clear to me that if I didn't have an intimate understanding
of Win32S (or whatever API was being used), even the few things that could
be formatted by some dump reading/viewing tool would forever remain obscure.
That is a completely different world, and although we have folks that are
immersed in it, that's not me. But more important to this discussion, I
think that only the most simple of the "dump reading" skills that I do have
would apply to that environment. I know how to debug code, but to debug it
from just a dump (which is frequently all you have) requires a set of skills
that are dependent on the type of bug it is and the environment in which it
manifests itself. If your own code did it, and no other code is involved
(meaning no operating system services), it just requires "debugging skills"
and the only "dump reading" skills needed are how to use the tool to move
around in the dump and find stuff. But if your code didn't do it, or your
code did do it but interacted with other code, potentially messing it up,
much more than just "debugging skills" are needed. For example, when things
go South because you stomped on somebody else's storage (or they on yours),
the bug may not even be in "your code" that you are responsible for
maintaining. The bug may be in some other product or some other component or
even in some other vendor's product (including IBM's stuff or in z/OS
itself). Sometimes we have only a dump with storage (of ours) that [we later
determined] got overlaid literally weeks or months before something failed
and the event came to human attention. Finding the footprints of something
like that in a dump usually requires luck of course, but a completely
different set of skills than navigating your way around something more
straightforward.

For folks that want to learn how to read system dumps that are the
consequence of failures in the operating system itself, or products like
ours that behave like they are part of the operating system, I usually
answer the question this way: First, learn how to use every service and
function that IBM documents. Write a program to use it. Then write three
more. Lather, rinse, repeat. Take or cause dumps in the middle of all that
stuff going on and see what things look like in storage when you do so.
Learn to recognize the several dozen major task, storage, data management,
and contents supervision control blocks on sight, in hex, and what things
you did (called) in your program then look like in the system trace, et al.
Learn the hex codes for at least the most frequently appearing 150
instructions on sight (and if you don't know how they work, write code that
uses them until you do). In other words, learn how to disassemble
instructions in your mind, on the fly, without having to look anything up,
from the hex in a dump into Assembler source. Learn the offsets of pointers
to significant operating system task and data management control blocks in
other control blocks.

Where is the pointer to the top RB in a TCB? Where is R11 in a savearea?
Where is the pointer to the UCB in a DEB, and where is the pointer to the
DEB in the DCB? Is that a DCB or an ACB? Is that a DSCB or a JFCB? Where is
the TIOT and is that entry in use or not? Where are the registers in a LSE?
You don't have time to go look this stuff up or go back to the formatted
dump to find them. You have to recognize and be able to follow things to get
somewhere else literally on sight, else it will take days to read a dump
that someone else can polish off in 10 minutes.

I think you first have to learn what a well-running system looks like
internally. Only then can you recognize what hit the fan when something goes
awry. I don't know any way to do that other than getting years of experience
under your belt, doing and learning things, step by step. But how does one
learn what 20 years' experience yields doing stuff like that? Well, first,
you need to spend 20 years doing stuff like that.

But, you have to DO it. Nobody is going to teach you everything you need to
know. You have to teach yourself. That's the way it was nearly 50 years ago
when I started working with System/360 and OS/360. There was nobody around
to teach any of us. Of course, at that time, dumps were printed and required
no more than 27 pages (standalone or not). Later, a standalone dump for a
program check in NIP at IPL on a 512 KB 360/75, however, took 274 pages and
15 minutes to print on a 1403-N1 printer. I remember we thought we were hot
stuff, then. Nothing changes.

--
WB

Reply via email to