Well, the affected LPAR is capped at 6% CPU. And as it turns out a SMAPI 
process is active at 5% for one or two minutes, and just in that minute the 
MONWRITE machine might have had just a fraction too little CPU time. Oddly 
enough we also have a reporting machine processing MONITOR records. Where 
MONWRITE simply writes selected records, the reporting machine has some serious 
plumbing but that machine had processed all records within time. (as for CPU 
usage, MONWRITE uses 0.07% CPU and MONREPT 0.20%, both have a share ABS 3%.)

I was testing a few records in the MONITOR so last week I added a few records 
to write to disk. This probably has pushed MONWRITE over the available limit 
because that added just a few more milliseconds to write data. I guess the IO 
time is the critical part here. The tests didn't yet give the results I was 
looking for so I guess I can remove them from writing to disk. But it 
highlighted a problem in MONWRITE so let's solve that first.

Indeed an IPL CMS is a bit rude. But OTOH, the sole purpose of the MONWRITE 
machine is to write records, it doesn't process them other than prepare for a 
PUTFILES stage. So it's easier to just restart CMS. (Granted, one could also 
queue "PROFILE" or "MONCOLCT" to restart the process instead of an IPL.)

Meanwhile, I am testing a new version of the pipeline. The STARMON records now 
travel through the secondary input/output of GATE. The primary input of GATE is 
used to stop the pipeline, for instance upon the STARMSG STOP. As I understand 
it GATE will now stop the PIPE if 1) a message arrives at the primary input or 
2) when STARMON is no longer writing records in the secondary input stream.

No, nothing else updates CP MONITOR. But a good point, when PERSMAPI is 
activated it executes a profile that indeed updates the MONITOR domains. I 
found that when I first activated SMAPI and then the entire reporting was 
stopped because PERSMAPI had messed with CP MONITOR. I have copied the PERSMAPI 
EXEC to it's A disk and removed the MONITOR statements from that exec.

Met vriendelijke groet/With kind regards/Mit freundlichen Grüßen,
Berry van Sleeuwen
Flight Forum 3000 5657 EW Eindhoven

-----Original Message-----
From: CMSTSO Pipelines Discussion List <[email protected]> On Behalf 
Of Rob van der Heij
Sent: Friday, November 08, 2019 11:35 AM
To: [email protected]
Subject: Re: [CMS-PIPELINES] Trap error in stage STARMON

On Fri, 8 Nov 2019 at 11:09, van Sleeuwen, Berry <[email protected]>
wrote:

> Hi Rob,
>
> Indeed I have rexx logic after the PIPE to restart when required (IF
> rc=313 then CP IPL CMS PARM AUTOCR). The PIPELINE has two parts. The
> first part is STARMSG to send commands to the machine. The second part
> is the STARMON processing. It selects domains/records based on a
> parameter file (such as 04 0003 User activity records) an then writes
> them to disk. The output stream for STARMON is never servered except when the 
> disk is full.
>
> Looking at the performance data I can see the LPAR was at 100% CPU at
> this time so probably the machine didn't get enough CPU to process data in 
> time.
> Indeed I did get the HCPMOV6274I message but I didn't copy that line.
> In fact, that's the reason I have coded the restart when the PIPE ends
> with
> 313 in the past. As mentioned, I have seen a couple of times in the
> past that this would have stopped the entire PIPE but now it looks
> like only the STARMON was stopped.
>
> In this case I'm looking for a way to stop the PIPELINE when the
> STARMON stage stops collecting data. Indeed maybe I can rewrite the
> logic using the GATE stage.
>

IPL CMS is a bit rude. The 313 error terminates the STARMON stage, right?
That means the following business logic sees end-of-file on the input stream 
and should wrap up things because there's nothing more to come. You might use 
JEREMY to figure out which part of the pipeline is in a catch-22 (something 
like a CONS stage would already). If nothing else, you could FANOUT and TAKE 
LAST after STARMON and use that to fire a GATE that cuts the line further down.

But your (4,3) records is sample data. You have the entire minute to finish 
your work. If that's not enough for this sample, it probably will not be for 
the next sample either. My rough guess is that a significant piece of plumbing 
might consume 0.5% of a CPU to keep up. You can also tell STARMON to skip some 
domains or just get the SAMPLE records. If you really can't
300 ms for a minute, then your systems are a lot worse than they used to be
:-)  Could it be something else is doing a MONITOR STOP and MONITOR START 
behind your back?

When your plumbing takes a lot more resources, you might want to talk to RITA 
and hear where to rework the code. Feel free to post the challenges.

Sir Rob the Plumber
This e-mail and the documents attached are confidential and intended solely for 
the addressee; it may also be privileged. If you receive this e-mail in error, 
please notify the sender immediately and destroy it. As its integrity cannot be 
secured on the Internet, Atos’ liability cannot be triggered for the message 
content. Although the sender endeavours to maintain a computer virus-free 
network, the sender does not warrant that this transmission is virus-free and 
will not be liable for any damages resulting from any virus transmitted. On all 
offers and agreements under which Atos Nederland B.V. supplies goods and/or 
services of whatever nature, the Terms of Delivery from Atos Nederland B.V. 
exclusively apply. The Terms of Delivery shall be promptly submitted to you on 
your request.

Reply via email to