Very interesting analysis Peter. We are beginning our conversion to COBOL V5.2 and this will be a great help to us.
Peter -----Original Message----- From: IBM Mainframe Discussion List [mailto:[email protected]] On Behalf Of Peter Hunkeler Sent: Tuesday, May 16, 2017 7:49 AM To: [email protected] Subject: Long execution & high CPU usage due to decimal overflow (PGM 00A) and large system trace tables Not a question, just some interesting stuff for the curious people :-) I have recently been asked to help analyze why a job was running very long and was using much CPU. Since all DB2 access paths were very efficient, we started an MA-Tune trace to see where the job was spending its time. MA-Tune reported that the job was spending almost 100% of its time at the following statement in module XYZ: MOVE ARG-CNT TO DISPLAY-ARG-CNT This did not make much sense, because that MOVE statement is not part of any loop construct. One would have expected that MA-Tune would see the job in other places as well. Since MA-Tune could not help, I have taken an SVC-Dump of the job the next time I was made aware it was not ending again. Analysis of the dump showed that the job was spending almost all time recovering from a program check 00A (decimal owerflow exception) . It was not obvious whether this was a recovery loop because of some error in the environment, i.e. Cobol runtime, Language Environment, or some restart product we're using, although neither of this seemed plausible. Further investigation finally showed that the problem arises because the target field in the above MOVE statement is not large enough to receive the source value once it grows above 9'999'999: The source and target fields are declared as follows: 01 ARG-CNT PIC 9(10) BINARY 01 DISPLAY-ARG-CNT PIC 9(07) Ignoring any overflow when the source count is greater than 9'999'999 might be intended from a programming point of view, but it may cause a dramatic performance degradation. The reason for the performance degradation lies in the following facts: * Firstly, Cobol wants to silently ignore a decimal overflow when it appears as part of a MOVE (and other) statement. * Secondly, the processor (hardware) supports two modes of operation regarding decimal overflow: It can either indicate an overflow by setting the condition code, or it can raise a program check hardware interruption of type 00A. While coding techniques such as in the example above did rarely hurt performance in Cobol up to version 4, it may more likely cause problems with Cobol V5.2 and above. The reason being that the Cobol compiler is making use of modern, better performing instructions on one hand, and the high likelihood that the processor is running in the mode where an overflow causes a hardware interruption to occur on the other hand. Every hardware interruption has to be handled by Language Environment's error recovery routines. This recovery code will interpret the interrupt, and in the case above comes to the conclusion that the decimal overflow is to be expected, and to be silently ignored. It will thus give control back to the interrupted program code. Why is this so bad from a performance perspective? The path length of the code involved from the time the operating system recognizes the program check hardware interruption until Language Environment returns control to the interrupted program is quite long by itself. In addition it depends on the system setup. Especially the size of the system trace table has a great influence in the elapsed time. This system has 4 (logical) CPs, 3 (logical) z IIPs, and a system trace table size of 15MiB per processor. It took between 0.05 and 0.15 seconds (yes, this is in seconds!) elapsed time to handle a single overflow. That time is spent mostly to create a snapshot of the 105 MiB (7*15MiB) system trace tables in anticipation of a dump request. At the time the dump was taken (I had been involved late), the source counter value was 10'181'270, i.e. an overflow exception had occurred for the last 181'271 times the above statement was executed. Ignoring any other delay this alone took some 181271 * 0.1s, i.e. some 5 hours of elapsed time. Regards Peter -- This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an authorized representative of the intended recipient, you are hereby notified that any dissemination of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the message and any attachments from your system. ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN
