Not a question, just some interesting stuff for the curious people :-)
I have recently been asked to help analyze why a job was running very long and
was using much CPU. Since all DB2 access paths were very efficient, we started
an MA-Tune trace to see where the job was spending its time. MA-Tune reported
that the job was spending almost 100% of its time at the following statement
in module XYZ:
MOVE ARG-CNT TO DISPLAY-ARG-CNT
This did not make much sense, because that MOVE statement is not part of any
loop construct. One would have expected that MA-Tune would see the job in other
places as well. Since MA-Tune could not help, I have taken an SVC-Dump of the
job the next time I was made aware it was not ending again.
Analysis of the dump showed that the job was spending almost all time
recovering from a program check 00A (decimal owerflow exception) . It was not
obvious whether this was a recovery loop because of some error in the
environment, i.e. Cobol runtime, Language Environment, or some restart product
we're using, although neither of this seemed plausible.
Further investigation finally showed that the problem arises because the
target field in the above MOVE statement is not large enough to receive the
source value once it grows above 9'999'999: The source and target fields are
declared as follows:
01 ARG-CNT PIC 9(10) BINARY 01 DISPLAY-ARG-CNT
PIC 9(07)
Ignoring any overflow when the source count is greater than 9'999'999 might
be intended from a programming point of view, but it may cause a dramatic
performance degradation. The reason for the performance degradation lies in
the following facts:
· Firstly, Cobol wants to silently ignore a decimal overflow when it
appears as part of a MOVE (and other) statement.
· Secondly, the processor (hardware) supports two modes of operation
regarding decimal overflow: It can either indicate an overflow by setting the
condition code, or it can raise a program check hardware interruption of type
00A. While coding techniques such as in the example above did rarely hurt
performance in Cobol up to version 4, it may more likely cause problems with
Cobol V5.2 and above. The reason being that the Cobol compiler is making use of
modern, better performing instructions on one hand, and the high likelihood
that the processor is running in the mode where an overflow causes a hardware
interruption to occur on the other hand. Every hardware interruption has to be
handled by Language Environment's error recovery routines. This recovery code
will interpret the interrupt, and in the case above comes to the conclusion
that the decimal overflow is to be expected, and to be silently ignored. It
will thus give control back to the interrupted program code.
Why is this so bad from a performance perspective? The path length of the code
involved from the time the operating system recognizes the program check
hardware interruption until Language Environment returns control to the
interrupted program is quite long by itself. In addition it depends on the
system setup. Especially the size of the system trace table has a great
influence in the elapsed time. This system has 4 (logical) CPs, 3 (logical) z
IIPs, and a system trace table size of 15MiB per processor. It took between
0.05 and 0.15 seconds (yes, this is in seconds!) elapsed time to handle a
single overflow. That time is spent mostly to create a snapshot of the 105 MiB
(7*15MiB) system trace tables in anticipation of a dump request. At the time
the dump was taken (I had been involved late), the source counter value was
10'181'270, i.e. an overflow exception had occurred for the last 181'271 times
the above statement was executed. Ignoring any other delay this alone took
some 181271 * 0.1s, i.e. some 5 hours of elapsed time.
Regards Peter
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN