Search390.com Developer Tip September 5, 2001 ======================================================== SPONSORED BY: Storage Decisions 2001 ======================================================== TIME IS RUNNING OUT to qualify to attend the FREE Storage Decisions 2001 conference in Chicago September 26-28. Learn to set strategy/make the smartest decisions to most effectively manage your storage initiatives for 2002. Pre-qualified audience of your peers and NO vendor sales pitches. Don't get shut out at this critical event. Visit http://ad.doubleclick.net/clk;3230190;5058249;c?http://www.StorageDecisions2001.com today! ======================================================== ============================================ search390 TIPS CONTEST ============================================ Have a great tip of your own that you'd like to share with the rest of us? Then send it in! September's tip of the month contest has started. This month's prize is an Olympus Stylus Epic Zoom camera. To submit your tip, or to check out the rules and the new prize, go to http://search390.techtarget.com/tipsSubmit/1,289485,sid10,00.html Congratulations to Steven Pritikin, our winner for July/August! ============================================ TODAY'S DEVELOPER TIP: Code rage - Getting blamed for traffic (CPU) slowdown Jim Keohane You are incensed that the Strobe cpu performance monitor indicates your simple and elegant code as a performance bottleneck. But it is. You can't believe someone else's horribly obtuse and convoluted revisions to your code, though much faster, are an acceptable answer. You positively cringe at the future maintenance issues after seeing the assembler routine version, though fastest by far, that was lovingly handcrafted by the resident bit-twiddler in a non-intuitive use of esoteric machine instructions. Calm down. Take a chill pill. In some cases, you have to make a trade off. It's like the unix programmer's joke definition: MVS: Operating system, some assembly required. The following code examples start with a straightforward implementation in a HLL (I picked SAS/C but could have used COBOL or PL/I or ...) of a simple algorithm that could have a cpu performance impact depending on the application. The algorithm simply calculates a checksum for a buffer by starting with X'00' and XOR'ing it with all bytes. An XOR (Exclusive OR) operates bit by bit resulting in a 1 bit only if XORee bit and XORer bit differ. It can be restated that the result, when compared to XORee, is either unchanged or toggled depending on whether XORer is 0 or 1. e.g. XOR'ing X'F0F0' with X'FF00' gives X'0FF0' XOR'ing B'11001100' with B'11110000' gives B'00111100' Here's simple, elegant C implementation: char *bfr, c = 0; int len; // insert code here to set len and set bfr to point to data needing a checksum while (len--) c ^= bfr[len]; // note XOR from right to left is OK. Order not important You can't get much simpler than the one line of executable code above. But let's assume the buffers are huge and/or the checksum is done so frequently that a significant cpu overhead is detected. In my case the simple code took 1/4 cpu second. Consider three very broad categories of S/390 machine instructions, RR (register to register), RS (register to storage) and SS (storage to storage). RR involves no memory access and cpu time is often not even measurable. SS does memory access for both two operands and takes a big cpu hit, especially as size of memory accessed increases. RS is somewhere in between. Assuming len is 1 million then the simple example above, even with the best compiler optimization*, involves a million+ storage accesses (RS?). A peek at the generated assembler may be informative. Let's try some tweaking of the C code first. Hmmmm. Instead of a million byte accesses how about 250,000 word accesses? For protection we can either force 'len' to always be a multiple of 4 or, better still, just make sure buffer always has at least 3 extra unused bytes at the end. Those extra bytes can be set to x'00' so their inclusion in the algorithm doesn't affect the checksum. Remember, a XOR of a value by ZERO leaves the value unchanged. For the nit-pickers we can also ensure the buffer starts on a word boundary. Here's revised word, not byte, C implementation: char *bfr, c; int *wbfr, len; // insert code here to set len and pad buffer with zeros k = 0; wbfr = (int *) bfr; j = (i + 3) / 4; // number of words while(j--) k ^= ibfr[j]; // k is word containing XOR of all words c = (char) k; // set c to XOR of 4 bytes in k c ^= (char) (k >> 8); c ^= (char) (k >> 16); c ^= (char) (k >> 24); Above took 1/24 cpu second. So, doing fewer storage accesses that each handle more bytes is a boost. Do you see a trend? Different languages and different compilers produce different generated machine instructions. Still we can assume the above generates at least one RS per word in buffer. We could try quadwords! There's a Load Multiple (LM) instruction that can load 16 bytes into 4 registers in one instruction. It can actually do up to 16 registers but let's start with 4. Here's revised quadword C implementation: char *bfr, c; int len, j, k=0; // insert code here to set len and pad buffer with zeros j = (i + 15) / 16; while (j--) { _ldregs(R1,bfr); LM(0,3,0+b(1)); XR(0,1); XR(0,2); XR(0,3); k ^= _stregs(R0); bfr+=16; } c = (char) k; c ^= (char) (k >> 8); c ^= (char) (k >> 16); c ^= (char) (k >> 24); It's now done to 1/45 cpu second. The funny looking code in the WHILE loop is SAS/C Inline Assembler. There are tradeoffs involved between inline coding and in calling a separately coded assembler subroutine. SAS/C allows the code to muck with registers R14, R15, R0, R1, R2 and R3. That also means those registers are not available to SAS/C optimization logic in the area of the code. That may or may not be significant. A separate assembler routine could use all the registers at the costs of saving and restoring them. I have a maintenance preference for inline when code is simple and small and where the cpu beneifits are close enough or better than calling assembler subroutine. So doing even fewer storage accesses that each deal with bigger chunks of storage is a further benefit. Are we stuck now? Nope. Here comes the non-intuitive part. How about using the horrible SS instructions? There's an XC instruction that XOR's a storage location with another for up to 256 bytes Here's revised 256-byte chunk C implementation that generates a 256-byte checksum in C, takes a word checksum from the words in C and finally takes the byte checksum from the checksum word, k. char *bfr, c, k=0; int C[64], len; memset(C,0,256); // start with zero checksum in 256 bytes or 64 words j = (i + 255) / 256; while (j--) { _ldregs(R1+R2,C,bfr); XC(0+b(1),256,0+b(2)); bfr+=256; } for(j=0;j<64;j++) k ^= C[j]; // accumulate word checksum in k c = (char) k; c ^= (char) (k >> 8); c ^= (char) (k >> 16); c ^= (char) (k >> 24); XC version took a little over 1/100 cpu second! We earlier noted the XORing order was unimportant. It is also intuitively obvious that the width of the XOR operation (bit, byte, word, etc.) is irrelevant as long as we take care to end up with desired width AND as long as every source bit/byte/word/whatever participates just once. What is non-intuitive is that a reduced number of the most offending SS instructions may actually be the performance answer in some cases. The XC (Exclusive Or Character) instruction's results are even more surprising in that the storage access involves both a READ and a WRITE where WRITEs are even more of a cpu impact. Welcome to the world of S/390 cpu performance! {smile} I tried variations doing multiple LM's or XC's within the inline code. I also moved the loop count (j) within the inline like: _ldregs(R1+R2+R3,C,w,j); // pass counter j to inline loop BALR(14,0); // loop back to XC XC(0+b(1),256,0+b(2)); LA(2,0,256+b(2)); // bump to next 256 byte chunk BCTR(3,14); // decrement reg3 (was j) and loop to XC if any left Now it's better than 1/100 cpu second. You probably don't fight to keep the original pristine code. Neither do you go for the most convoluted example of obfuscated code just because it's marginally the fastest (unless you're a consultant). You trade off performance vs. maintainability while weighing the cpu improvement as a percentage of the total. The XC loop example above was 25 times faster than the original source. Still, I might very leave the original code (or the C-only word version) if we were only talking about a small overall cpu improvement. Let's get cute. Suppose the checksum logic is needed both to verify received data and to generate a new checksum if data is modified. You may have some nifty shortcuts depending on the nature of the changes made. For example, if data is appended or inserted then you can use prior checksum with the new bytes XOR'd into it. If a chunk of data is removed you can take old checksum and XOR it with data to be removed! Remember, XOR with 0 has no effect. XOR with 1 does a toggle. That means an additional XOR with a byte that was already a component of the checksum effectively undoes any toggles. The resulting checksum is as if those bytes were not included in the first place. It's faster to do a few XOR's when 10 or 20 bytes are changed, removed or deleted rather than redo entire buffer, no? The bottom line is that non-intuitive approaches are, unfortunately, worth taking on occasion. It is also the case that it may require assembler coding. Dropping to assembler does not always help performance. After all, that's what the compiler is doing for you anyway. {smile} But dropping to assembler can be the answer where the need is sufficient and there's no other way to effectively implement a non-intuitive approach. Note: If your application deals with very large buffers the padding overhead from "extra" XOR'ing won't be worth worrying about. However, you can do a little tweaking here and there. It might be worthwhile to drop the padding idea and just do the last partial chunk using earlier byte-by-byte or word-by-word approach. *best compiler optimization: I've worked on compiler optimizers enough to not completely rule out their ability to recognize the required side effects of the simple while loop and, with an appreciation of the underlying machine instruction set, produce the results in a radically different method. I just haven't seen it yet. The optimizer would also have to recognize the code was significant enough to be worth changing. Perhaps if 'len' was hardcoded with a large value and/or the optimizer could somehow determine (at compile time?) how frequently the algorithm was to be invoked at runtime. More likely future optimizers will incorporate statistics gathering into the generated code and have those runtime stats fed back into recompile iterations. ------------------------------------------------------ About the author: Jim Keohane ([EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>) is president of New York consulting company Multi-Platforms, Inc. His company specializes in commercial software development/consulting with emphasis on cross-platform and performance issues. Did you like this tip? Send us an email <mailto:[EMAIL PROTECTED]> to let us know your thoughts, or rate this tip by scrolling to the bottom of http://www.search390.com/tip/1,289483,sid10_gci765019,00.html. ======================================================= search390 Featured Topic: Database management ======================================================= Maintaining corporate data is a daunting task. Getting that data to the Web is another challenge many DBAs face. When it comes to database-Web connectivity, which system do you think meets the challenge best: IMS or DB2? http://search390.techtarget.com/featuredTopic/0,290042,sid10_gci764289,00.html ======================================================= ======================================================= Share the knowledge! ======================================================= Have COBOL-related questions? Then be sure to join us tomorrow in the Developer discussion forum. Tom Ross of IBM will be live in the Developer forum tomorrow from 1-2 pm EDT and again from 4-5 pm EDT to answer your programming questions live! Don't miss the chance to get instant answers to your code questions from a leading COBOL expert! If you'd like to start posting questions for Tom now, go to: http://search390.discussions.techtarget.com/WebX?[EMAIL PROTECTED]^[email protected] See you there! Also check out our other search390 forums on operating systems, e-business, and sound off (aka miscellaneous!). There's sure to be something there to spark your interest. http://search390.discussions.techtarget.com/WebX?[EMAIL PROTECTED]^0@/search390 And don't be left out of the loop when it comes to the latest S/390 news. Whether you're a Developer, Systems Manager, or busy Web Enabling the mainframe, you need to stay abreast of the latest S/390 news and technologies. As a member of search390, you can get the News delivered right to your inbox. To try out this free e-mail service for yourself, just update your user profile at http://search390.techtarget.com/register/1,,sid10,00.htmland be sure you select the "News" checkbox. ======================================================== Disclaimer: Our tips exchange is a forum for you to share technical advice and expertise with your peers and to learn from other IT professionals. Techtarget.com provides the infrastructure to facilitate this sharing of information. However, we can't guarantee the accuracy and validity of the material submitted. You agree that your use of the ask the expert services and your reliance on any questions, answers, information or other materials received through the web site will be at your own risk. ======================================================== ======================================================== If you would like to sponsor this or any TechTarget newsletter, please contact Tim Keough at [EMAIL PROTECTED] ======================================================== If you no longer wish to receive this newsletter simply reply to this message with "REMOVE" in the subject line. Or, visit http://search390.techtarget.com/register and adjust your subscriptions accordingly. If you choose to unsubscribe using our automated processing, you must send the "REMOVE" request from the email account to which this newsletter was delivered. Please allow 24 hours for your "REMOVE" request to be processed.
