Tivoli Workshop Scheduler "Eats Lunch" Resolution

Jim Marshall Tue, 12 Aug 2008 12:08:31 -0700

Working with IBM Level 2 or maybe 3, we now understand what is causing the 
excessive CPU time being used by the Distributed component of Tivoli 
Workload Scheduler. I will review the scenario:
 
Running a IBM 2096-O02 (36MSU) and 2096-T03 (95MSU) machines in a 
Parallel Sysplex where TWS runs on the "O02" system (smaller of the two). 
TWS is scheduling work in the Parallel Sysplex and also there is a distributed 
component for scheduling for 3-4 Windows Servers.  Historically it is 
interesting for TWS had its roots in an IBM product called OPC (Operator 
Control) which did z/OS and distributed scheduling using "Trackers". It worked 
very well using little CPU time.  OPC morfed itself into Tivoli and became TWS 
for z/OS and IBM bought a company called Maestro which did distributed 
scheduling. The two products were merged and Trackers went away. It took 
IBM a few years to fully integrate the two products. This brings it down to the 
present and performance issues encountered.
 
TWS for z/OS runs separately from other Started Task for distributed TWS 
called TWSE2E.   TWSE2E was seen taking about 3 MSUs worth of the O02 
when the system used to run around 28-29 MSUs max in a month. IBM 
researched the issue and came forth with the explanation which is not 
highlighted in any of the Tivoli manuals as far as we can read. The TWSE2E 
executes its programs in the O02's USS system and has files defined in a zFS 
file system.  If indeed that zFS file system is not owned by the LPAR where 
TWS is running, all the I/O must go through XCF in the Parallel Sysplex; 
generating the extraordinary amounts of CPU time seen as being used by 
TWSE2E in that LPAR. The recommendation now is always have the zFS file 
system mounted to the LPAR where TWS is operating (otherwise TWSE2E will 
eat your lunch, dinner, etc). When we switched TWS's zFS file system back to 
the TWS LPAR, the CPU consumption dropped to almost nothing.    
 
I can understand the recommendation and now it places some considerations 
to ponder:
 
    1. When a TWS LPAR is taken down the ownership of its zFS file system is 
automagically transferred to some other LPAR and it is not your choice which 
one (another interesting discussion could follow this line). So when the TWS 
LPAR is IPL'ed, operationally one must ensure the proper commands are issued 
to bring back ownership of TWS's zFS file system. 
 
    2. One can implement all of #1 in "Automation" if one is running some sort 
of automation package; a good case for getting one. 
 
    3. Keep in mind this is not a Parallel Sysplex problem but a zFS challenge. 
 
 
    4. I just have to wonder if all this is caused by I/O for TWSE2E having to 
go through XCF to get to the other LPAR where the zFS is owned, then why 
not the WAIT associated with I/O versus the heavy, heavy CPU load caused 
by this I/O (3-4 Windows Servers which get about 30-40 jobs per day)? 
 
        Note: I just have to believe there is more to the story and it may not 
be 
a TWS problem but maybe TWS exploiting something in USS and zFS which is 
a bad design. 
 
POSTSCRIPT:  Things are back using an acceptable amount of CPU and 
everyone is older and wiser. 
 
Jim 
 
P.S.  Wonder how many other z/OS USS implementations are using excessive 
CPU because of the ownership of some zFS file system. Will be on the watch 
for something like it in the future.


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Tivoli Workshop Scheduler "Eats Lunch" Resolution

Reply via email to