Working with IBM Level 2 or maybe 3, we now understand what is causing the
excessive CPU time being used by the Distributed component of Tivoli
Workload Scheduler. I will review the scenario:
Running a IBM 2096-O02 (36MSU) and 2096-T03 (95MSU) machines in a
Parallel Sysplex where TWS runs on the "O02" system (smaller of the two).
TWS is scheduling work in the Parallel Sysplex and also there is a distributed
component for scheduling for 3-4 Windows Servers. Historically it is
interesting for TWS had its roots in an IBM product called OPC (Operator
Control) which did z/OS and distributed scheduling using "Trackers". It worked
very well using little CPU time. OPC morfed itself into Tivoli and became TWS
for z/OS and IBM bought a company called Maestro which did distributed
scheduling. The two products were merged and Trackers went away. It took
IBM a few years to fully integrate the two products. This brings it down to the
present and performance issues encountered.
TWS for z/OS runs separately from other Started Task for distributed TWS
called TWSE2E. TWSE2E was seen taking about 3 MSUs worth of the O02
when the system used to run around 28-29 MSUs max in a month. IBM
researched the issue and came forth with the explanation which is not
highlighted in any of the Tivoli manuals as far as we can read. The TWSE2E
executes its programs in the O02's USS system and has files defined in a zFS
file system. If indeed that zFS file system is not owned by the LPAR where
TWS is running, all the I/O must go through XCF in the Parallel Sysplex;
generating the extraordinary amounts of CPU time seen as being used by
TWSE2E in that LPAR. The recommendation now is always have the zFS file
system mounted to the LPAR where TWS is operating (otherwise TWSE2E will
eat your lunch, dinner, etc). When we switched TWS's zFS file system back to
the TWS LPAR, the CPU consumption dropped to almost nothing.
I can understand the recommendation and now it places some considerations
to ponder:
1. When a TWS LPAR is taken down the ownership of its zFS file system is
automagically transferred to some other LPAR and it is not your choice which
one (another interesting discussion could follow this line). So when the TWS
LPAR is IPL'ed, operationally one must ensure the proper commands are issued
to bring back ownership of TWS's zFS file system.
2. One can implement all of #1 in "Automation" if one is running some sort
of automation package; a good case for getting one.
3. Keep in mind this is not a Parallel Sysplex problem but a zFS challenge.
4. I just have to wonder if all this is caused by I/O for TWSE2E having to
go through XCF to get to the other LPAR where the zFS is owned, then why
not the WAIT associated with I/O versus the heavy, heavy CPU load caused
by this I/O (3-4 Windows Servers which get about 30-40 jobs per day)?
Note: I just have to believe there is more to the story and it may not
be
a TWS problem but maybe TWS exploiting something in USS and zFS which is
a bad design.
POSTSCRIPT: Things are back using an acceptable amount of CPU and
everyone is older and wiser.
Jim
P.S. Wonder how many other z/OS USS implementations are using excessive
CPU because of the ownership of some zFS file system. Will be on the watch
for something like it in the future.
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html