Well, the affected LPAR is capped at 6% CPU. And as it turns out a SMAPI process is active at 5% for one or two minutes, and just in that minute the MONWRITE machine might have had just a fraction too little CPU time. Oddly enough we also have a reporting machine processing MONITOR records. Where MONWRITE simply writes selected records, the reporting machine has some serious plumbing but that machine had processed all records within time. (as for CPU usage, MONWRITE uses 0.07% CPU and MONREPT 0.20%, both have a share ABS 3%.)
I was testing a few records in the MONITOR so last week I added a few records to write to disk. This probably has pushed MONWRITE over the available limit because that added just a few more milliseconds to write data. I guess the IO time is the critical part here. The tests didn't yet give the results I was looking for so I guess I can remove them from writing to disk. But it highlighted a problem in MONWRITE so let's solve that first. Indeed an IPL CMS is a bit rude. But OTOH, the sole purpose of the MONWRITE machine is to write records, it doesn't process them other than prepare for a PUTFILES stage. So it's easier to just restart CMS. (Granted, one could also queue "PROFILE" or "MONCOLCT" to restart the process instead of an IPL.) Meanwhile, I am testing a new version of the pipeline. The STARMON records now travel through the secondary input/output of GATE. The primary input of GATE is used to stop the pipeline, for instance upon the STARMSG STOP. As I understand it GATE will now stop the PIPE if 1) a message arrives at the primary input or 2) when STARMON is no longer writing records in the secondary input stream. No, nothing else updates CP MONITOR. But a good point, when PERSMAPI is activated it executes a profile that indeed updates the MONITOR domains. I found that when I first activated SMAPI and then the entire reporting was stopped because PERSMAPI had messed with CP MONITOR. I have copied the PERSMAPI EXEC to it's A disk and removed the MONITOR statements from that exec. Met vriendelijke groet/With kind regards/Mit freundlichen Grüßen, Berry van Sleeuwen Flight Forum 3000 5657 EW Eindhoven -----Original Message----- From: CMSTSO Pipelines Discussion List <[email protected]> On Behalf Of Rob van der Heij Sent: Friday, November 08, 2019 11:35 AM To: [email protected] Subject: Re: [CMS-PIPELINES] Trap error in stage STARMON On Fri, 8 Nov 2019 at 11:09, van Sleeuwen, Berry <[email protected]> wrote: > Hi Rob, > > Indeed I have rexx logic after the PIPE to restart when required (IF > rc=313 then CP IPL CMS PARM AUTOCR). The PIPELINE has two parts. The > first part is STARMSG to send commands to the machine. The second part > is the STARMON processing. It selects domains/records based on a > parameter file (such as 04 0003 User activity records) an then writes > them to disk. The output stream for STARMON is never servered except when the > disk is full. > > Looking at the performance data I can see the LPAR was at 100% CPU at > this time so probably the machine didn't get enough CPU to process data in > time. > Indeed I did get the HCPMOV6274I message but I didn't copy that line. > In fact, that's the reason I have coded the restart when the PIPE ends > with > 313 in the past. As mentioned, I have seen a couple of times in the > past that this would have stopped the entire PIPE but now it looks > like only the STARMON was stopped. > > In this case I'm looking for a way to stop the PIPELINE when the > STARMON stage stops collecting data. Indeed maybe I can rewrite the > logic using the GATE stage. > IPL CMS is a bit rude. The 313 error terminates the STARMON stage, right? That means the following business logic sees end-of-file on the input stream and should wrap up things because there's nothing more to come. You might use JEREMY to figure out which part of the pipeline is in a catch-22 (something like a CONS stage would already). If nothing else, you could FANOUT and TAKE LAST after STARMON and use that to fire a GATE that cuts the line further down. But your (4,3) records is sample data. You have the entire minute to finish your work. If that's not enough for this sample, it probably will not be for the next sample either. My rough guess is that a significant piece of plumbing might consume 0.5% of a CPU to keep up. You can also tell STARMON to skip some domains or just get the SAMPLE records. If you really can't 300 ms for a minute, then your systems are a lot worse than they used to be :-) Could it be something else is doing a MONITOR STOP and MONITOR START behind your back? When your plumbing takes a lot more resources, you might want to talk to RITA and hear where to rework the code. Feel free to post the challenges. Sir Rob the Plumber This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, Atos’ liability cannot be triggered for the message content. Although the sender endeavours to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted. On all offers and agreements under which Atos Nederland B.V. supplies goods and/or services of whatever nature, the Terms of Delivery from Atos Nederland B.V. exclusively apply. The Terms of Delivery shall be promptly submitted to you on your request.
