Long execution & high CPU usage due to decimal overflow (PGM 00A) and large system trace tables

Peter Hunkeler Tue, 16 May 2017 04:49:57 -0700

Not a question, just some interesting stuff for the curious people :-)
 I have recently been asked to help analyze why a job was running very long and 
was using much CPU. Since all DB2 access paths were very efficient, we started 
an MA-Tune trace to see where the job was spending its time. MA-Tune reported 
that the job was spending almost 100% of its time at the following statement  
in module XYZ:
 MOVE   ARG-CNT     TO DISPLAY-ARG-CNT
 This did not make much sense, because that MOVE statement is not part of any 
loop construct. One would have expected that MA-Tune would see the job in other 
places as well. Since MA-Tune could not help, I have taken an SVC-Dump of the 
job the next time I was made aware it was not ending again.
 Analysis of the dump showed that the job was spending almost all time 
recovering from a program check 00A (decimal owerflow exception) . It was not 
obvious whether this was a recovery loop because of some error in the 
environment, i.e. Cobol runtime, Language Environment, or some restart product 
we're using, although neither of this seemed plausible.
 Further investigation finally showed that the problem arises because the 
target field in the above MOVE statement is not large enough to receive the 
source value once it grows above 9'999'999: The source and target fields are 
declared as follows:
 01 ARG-CNT                                PIC 9(10) BINARY 01 DISPLAY-ARG-CNT  
           PIC 9(07)
 Ignoring any overflow when the source count is greater than  9'999'999 might 
be intended from a programming point of view, but it may cause a dramatic 
performance degradation. The reason for the performance degradation  lies in 
the following facts:
 ·         Firstly, Cobol wants to silently ignore a decimal overflow when it 
appears as part of a MOVE (and other) statement.
 ·         Secondly, the processor (hardware) supports two modes of operation 
regarding decimal overflow: It can either indicate an overflow by setting the 
condition code, or it can raise a program check hardware interruption of type 
00A.  While coding techniques such as in the example above did rarely hurt 
performance in Cobol up to version 4, it may more likely cause problems with 
Cobol V5.2 and above. The reason being that the Cobol compiler is making use of 
modern, better performing instructions on one hand, and the high likelihood  
that the processor is running in the mode where an overflow causes a hardware 
interruption to occur on the other hand.  Every hardware interruption has to be 
handled by Language Environment's error recovery routines. This recovery code 
will interpret the interrupt, and in the case above comes to the conclusion  
that the decimal overflow is to be expected, and to be silently ignored. It 
will thus give control back to the interrupted program code.
 Why is this so bad from a performance perspective? The path length of the code 
involved from the time the operating system recognizes the program check 
hardware interruption until Language Environment returns control  to the 
interrupted program is  quite long by itself. In addition it depends on the 
system setup. Especially the size of the system trace table has a great 
influence in the elapsed time. This system has 4 (logical) CPs, 3 (logical) z 
IIPs, and a system trace table size of 15MiB per processor.  It took between 
0.05 and 0.15 seconds (yes, this is in seconds!) elapsed time to handle a 
single overflow. That time is spent mostly to create a snapshot of the 105 MiB 
(7*15MiB) system trace tables in anticipation of a dump request.  At the time 
the dump was taken (I had been involved late), the source counter value was 
10'181'270, i.e. an overflow exception had occurred for the last 181'271 times 
the above statement  was executed. Ignoring any other delay this alone took 
some 181271 * 0.1s, i.e. some 5 hours of elapsed time.
   Regards Peter


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Long execution & high CPU usage due to decimal overflow (PGM 00A) and large system trace tables

Reply via email to