Developer Tip: Code rage - Getting blamed for traffic (CPU) slowdown

search390.com Wed, 05 Sep 2001 06:59:53 -0700
Search390.com
Developer Tip
September 5, 2001

========================================================
SPONSORED BY: Storage Decisions 2001
========================================================
TIME IS RUNNING OUT to qualify to attend the FREE Storage Decisions
2001 conference in Chicago September 26-28. Learn to set
strategy/make the smartest decisions to most effectively manage your
storage initiatives for 2002. Pre-qualified audience of your peers
and NO vendor sales pitches. Don't get shut out at this critical
event. Visit
http://ad.doubleclick.net/clk;3230190;5058249;c?http://www.StorageDecisions2001.com 
today!
========================================================

============================================
search390 TIPS CONTEST 
============================================ 
Have a great tip of your own that you'd like to share with the rest
of us? Then send it in! September's tip of the month contest has
started. This month's prize is an Olympus Stylus Epic Zoom camera.  

To submit your tip, or to check out the rules and the new prize, go
to http://search390.techtarget.com/tipsSubmit/1,289485,sid10,00.html 

Congratulations to Steven Pritikin, our winner for July/August!
============================================
TODAY'S DEVELOPER TIP:

Code rage - Getting blamed for traffic (CPU) slowdown
Jim Keohane 

You are incensed that the Strobe cpu performance monitor indicates
your simple and elegant code as a performance bottleneck. But it is.
You can't believe someone else's horribly obtuse and convoluted
revisions to your code, though much faster, are an acceptable answer.
You positively cringe at the future maintenance issues after seeing
the assembler routine version, though fastest by far, that was
lovingly handcrafted by the resident bit-twiddler in a non-intuitive
use of esoteric machine instructions. 

Calm down. Take a chill pill. In some cases, you have to make a trade
off. It's like the unix programmer's joke definition:

MVS: Operating system, some assembly required. 

The following code examples start with a straightforward
implementation in a HLL (I picked SAS/C but could have used COBOL or
PL/I or ...) of a simple algorithm that could have a cpu performance
impact depending on the application.

The algorithm simply calculates a checksum for a buffer by starting
with X'00' and XOR'ing it with all bytes. An XOR (Exclusive OR)
operates bit by bit resulting in a 1 bit only if XORee bit and XORer
bit differ. It can be restated that the result, when compared to
XORee, is either unchanged or toggled depending on whether XORer is 0
or 1. 

e.g.

   XOR'ing X'F0F0' with X'FF00' gives X'0FF0'
   XOR'ing B'11001100' with B'11110000' gives B'00111100'

Here's simple, elegant C implementation:

char *bfr, c = 0; 
int len;
 // insert code here to set len and set bfr to point to data needing
a checksum
while (len--) c ^= bfr[len];  // note XOR from right to left is OK.
Order not important
You can't get much simpler than the one line of executable code
above. But let's assume the buffers are huge and/or the checksum is
done so frequently that a significant cpu overhead is detected. In my
case the simple code took 1/4 cpu second.

Consider three very broad categories of S/390 machine instructions,
RR (register to register), RS (register to storage) and SS (storage
to storage). RR involves no memory access and cpu time is often not
even measurable. SS does memory access for both two operands and
takes a big cpu hit, especially as size of memory accessed increases.
RS is somewhere in between.

Assuming len is 1 million then the simple example above, even with
the best compiler optimization*, involves a million+ storage accesses
(RS?). A peek at the generated assembler may be informative.

Let's try some tweaking of the C code first. Hmmmm. Instead of a
million byte accesses how about 250,000 word accesses? For protection
we can either force 'len' to always be a multiple of 4 or, better
still, just make sure buffer always has at least 3 extra unused bytes
at the end. Those extra bytes can be set to x'00' so their inclusion
in the algorithm doesn't affect the checksum. Remember, a XOR of a
value by ZERO leaves the value unchanged. For the nit-pickers we can
also ensure the buffer starts on a word boundary. 

Here's revised word, not byte, C implementation:

char *bfr, c; 
int *wbfr, len;
 // insert code here to set len and pad buffer with zeros
k = 0; wbfr = (int *) bfr;
 j = (i + 3) / 4;   // number of words
 while(j--) k ^= ibfr[j];  // k is word containing XOR of all words
 c = (char) k;   // set c to XOR of 4 bytes in k
 c ^= (char) (k >> 8);
 c ^= (char) (k >> 16);
 c ^= (char) (k >> 24);

Above took 1/24 cpu second. So, doing fewer storage accesses that
each handle more bytes is a boost. Do you see a trend? Different
languages and different compilers produce different generated machine
instructions. Still we can assume the above generates at least one RS
per word in buffer. We could try quadwords! There's a Load Multiple
(LM) instruction that can load 16 bytes into 4 registers in one
instruction. It can actually do up to 16 registers but let's start
with 4. 

Here's revised quadword C implementation:

char *bfr, c; 
int len, j, k=0;
// insert code here to set len and pad buffer with zeros
 j = (i + 15) / 16;
 while (j--)
 {
      _ldregs(R1,bfr);
      LM(0,3,0+b(1));
      XR(0,1);  XR(0,2);  XR(0,3);
      k ^= _stregs(R0);
      bfr+=16;
 }
 c = (char) k;
 c ^= (char) (k >> 8);
 c ^= (char) (k >> 16);
 c ^= (char) (k >> 24);

It's now done to 1/45 cpu second. The funny looking code in the WHILE
loop is SAS/C Inline Assembler. There are tradeoffs involved between
inline coding and in calling a separately coded assembler subroutine.
SAS/C allows the code to muck with registers R14, R15, R0, R1, R2 and
R3. That also means those registers are not available to SAS/C
optimization logic in the area of the code. That may or may not be
significant. A separate assembler routine could use all the registers
at the costs of saving and restoring them. I have a maintenance
preference for inline when code is simple and small and where the cpu
beneifits are close enough or better than calling assembler
subroutine.

So doing even fewer storage accesses that each deal with bigger
chunks of storage is a further benefit. Are we stuck now? Nope. Here
comes the non-intuitive part. How about using the horrible SS
instructions? There's an XC instruction that XOR's a storage location
with another for up to 256 bytes

Here's revised 256-byte chunk C implementation that generates a
256-byte checksum in C, takes a word checksum from the words in C and
finally takes the byte checksum from the checksum word, k.

char *bfr,  c, k=0; 
int C[64], len;
 memset(C,0,256);  // start with zero checksum in 256 bytes or 64
words
 j = (i + 255) / 256;
 while (j--)
 {
      _ldregs(R1+R2,C,bfr);
      XC(0+b(1),256,0+b(2));
      bfr+=256;
 }
 for(j=0;j<64;j++) k ^= C[j];  // accumulate word checksum in k
 c = (char) k;
 c ^= (char) (k >> 8);
 c ^= (char) (k >> 16);
 c ^= (char) (k >> 24);

XC version took a little over 1/100 cpu second! We earlier noted the
XORing order was unimportant. It is also intuitively obvious that the
width of the XOR operation (bit, byte, word, etc.) is irrelevant as
long as we take care to end up with desired width AND as long as
every source bit/byte/word/whatever participates just once.

What is non-intuitive is that a reduced number of the most offending
SS instructions may actually be the performance answer in some cases.
The XC (Exclusive Or Character) instruction's results are even more
surprising in that the storage access involves both a READ and a
WRITE where WRITEs are even more of a cpu impact. Welcome to the
world of S/390 cpu performance! {smile}

I tried variations doing multiple LM's or XC's within the inline
code. I also moved the loop count (j) within the inline like:

_ldregs(R1+R2+R3,C,w,j); // pass counter j to inline loop
 BALR(14,0);   // loop back to XC
 XC(0+b(1),256,0+b(2));
 LA(2,0,256+b(2));  // bump to next 256 byte chunk
 BCTR(3,14);  // decrement reg3 (was j) and loop to XC if any left

Now it's better than 1/100 cpu second.

You probably don't fight to keep the original pristine code. Neither
do you go for the most convoluted example of obfuscated code just
because it's marginally the fastest (unless you're a consultant). You
trade off performance vs. maintainability while weighing the cpu
improvement as a percentage of the total. The XC loop example above
was 25 times faster than the original source. Still, I might very
leave the original code (or the C-only word version) if we were only
talking about a small overall cpu improvement.

Let's get cute. Suppose the checksum logic is needed both to verify
received data and to generate a new checksum if data is modified. You
may have some nifty shortcuts depending on the nature of the changes
made. For example, if data is appended or inserted then you can use
prior checksum with the new bytes XOR'd into it. If a chunk of data
is removed you can take old checksum and XOR it with data to be
removed! Remember, XOR with 0 has no effect. XOR with 1 does a
toggle. That means an additional XOR with a byte that was already a
component of the checksum effectively undoes any toggles. The
resulting checksum is as if those bytes were not included in the
first place. It's faster to do a few XOR's when 10 or 20 bytes are
changed, removed or deleted rather than redo entire buffer, no?

The bottom line is that non-intuitive approaches are, unfortunately,
worth taking on occasion. It is also the case that it may require
assembler coding. Dropping to assembler does not always help
performance. After all, that's what the compiler is doing for you
anyway. {smile}

But dropping to assembler can be the answer where the need is
sufficient and there's no other way to effectively implement a
non-intuitive approach.

Note: If your application deals with very large buffers the padding
overhead from "extra" XOR'ing won't be worth worrying about. However,
you can do a little tweaking here and there. It might be worthwhile
to drop the padding idea and just do the last partial chunk using
earlier byte-by-byte or word-by-word approach.

*best compiler optimization: I've worked on compiler optimizers
enough to not completely rule out their ability to recognize the
required side effects of the simple while loop and, with an
appreciation of the underlying machine instruction set, produce the
results in a radically different method. I just haven't seen it yet.
The optimizer would also have to recognize the code was significant
enough to be worth changing. Perhaps if 'len' was hardcoded with a
large value and/or the optimizer could somehow determine (at compile
time?) how frequently the algorithm was to be invoked at runtime.
More likely future optimizers will incorporate statistics gathering
into the generated code and have those runtime stats fed back into
recompile iterations.

------------------------------------------------------
About the author: Jim Keohane ([EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>) is president of New York
consulting company Multi-Platforms, Inc. His company specializes in
commercial software development/consulting with emphasis on
cross-platform and performance issues. 

Did you like this tip? Send us an email <mailto:[EMAIL PROTECTED]>
to let us know your thoughts, or rate this tip by scrolling to the
bottom of
http://www.search390.com/tip/1,289483,sid10_gci765019,00.html.

=======================================================
search390 Featured Topic: Database management
=======================================================
Maintaining corporate data is a daunting task. Getting that data to
the Web is another challenge many DBAs face. When it comes to
database-Web connectivity, which system do you think meets the
challenge best: IMS or DB2? 
http://search390.techtarget.com/featuredTopic/0,290042,sid10_gci764289,00.html
=======================================================

=======================================================
Share the knowledge!
=======================================================
Have COBOL-related questions?  Then be sure to join us tomorrow in
the Developer discussion forum.  Tom Ross of IBM will be live in the
Developer forum tomorrow from 1-2 pm EDT and again from 4-5 pm EDT to
answer your programming questions live!  Don't miss the chance to get
instant answers to your code questions from a leading COBOL expert! 
If you'd like to start posting questions for Tom now, go to: 
http://search390.discussions.techtarget.com/WebX?[EMAIL PROTECTED]^[email protected]
See you there!

Also check out our other search390 forums on operating systems,
e-business, and sound off (aka miscellaneous!).  There's sure to be
something there to spark your interest.
http://search390.discussions.techtarget.com/WebX?[EMAIL PROTECTED]^0@/search390

And don't be left out of the loop when it comes to the latest S/390
news.  Whether you're a Developer, Systems Manager, or busy Web
Enabling the mainframe, you need to stay abreast of the latest S/390
news and technologies. As a member of search390, you can get the News
delivered right to your inbox. To try out this free e-mail service
for yourself, just update your user profile at
http://search390.techtarget.com/register/1,,sid10,00.htmland be sure
you select the "News" checkbox.

========================================================
Disclaimer: Our tips exchange is a forum for you to share technical
advice and expertise with your peers and to learn from other IT
professionals. Techtarget.com provides the infrastructure to
facilitate this sharing of information. However, we can't guarantee
the accuracy and validity of the material submitted. You agree that
your use of the ask the expert services and your reliance on any
questions, answers, information or other materials received through
the web site will be at your own risk.
========================================================

======================================================== 
If you would like to sponsor this or any TechTarget newsletter,
please contact Tim Keough at [EMAIL PROTECTED]
======================================================== 


If you no longer wish to receive this newsletter simply reply to 
this message with "REMOVE" in the subject line.  Or, visit 
http://search390.techtarget.com/register 
and adjust your subscriptions accordingly. 

If you choose to unsubscribe using our automated processing, you 
must send the "REMOVE" request from the email account to which 
this newsletter was delivered.  Please allow 24 hours for your 
"REMOVE" request to be processed.
Developer Tip: Code rage - Getting blamed for traffic (CPU) slowdown

Reply via email to