Sometimes the compilers are better than humans... for example multiply is expensive, and divide is even more expensive but the compiler can handle constants *Multiply * x * 10 =x * 8 + x * 2 = (x shift left 3 bits) + (x shift left 1 bit) shifts can be done in parallel then added (two clock ticks)
*Divide* x/3 = x * 341 /1024 = x * 341 >> 10 bits 17 / 3 = 17 * 341 /1024 = 5 (2+ 1 clock ticks) These numbers (341, 1024) are not unique 1024/3 = 341: 1/3 = 341/1024 65536/3 = 21845 : 1/3 = 21845/65536 I thought this amazing - multiply and divide without using multiply or divide instructions. Colin On Sun, 24 Aug 2025 at 11:39, Jonathan Scott < 00001b5498fc732f-dmarc-requ...@listserv.uga.edu> wrote: > I totally agree that in most cases performance is achieved by using the > right design and algorithms. Simplicity and reliability of code is also > very important, and for code which is not performance-critical there is > little point in attempting local optimization at the expense of > simplicity. It is usually only for extremely intensively executed code > (innermost loops) where any sort of local optimization is worth the > effort. It used to be that reordering sequences of instructions to avoid > address generation interlocks and other pipeline blocks could achieve > significant improvements, but recent IBM Z processors now handle much of > that automatically. Keeping things in registers (including vector > registers) to avoid storage access is still useful, and some newer > instructions can help simplify code as well as improving performance, for > example the "interlocked-access facility 1" makes it simple and fast to use > ASI to update shared counters. The IBM Z hardware people have always said > that you should use obvious standard sequences of code as those will be the > ones that they are trying to optimise, so for example exclusive-or of a > storage location with itself is typically interpreted as an instruction to > store zeroes in that storage, and the standard MVC with offset of 1 byte is > interpreted as an instruction to fill storage with a pad byte. There are a > few performance oddities that are worth noting at the algorithm level, for > example if you repeatedly look at the same offset in many 4K pages you may > get performance degradation because there are only a limited number of > cache lines for each 256-byte range, so it may be better to maintain a > separate compact index containing the same information. > > And comments are essential not just for future readers of the code, but > also to ensure that the person writing the code can explain what they are > doing, ensuring they have a full understanding. I generally wrote the > block comments before I wrote the code. Back in the late 1970s I wrote a > very concise piece of bit-twiddling code to set VSAM options which was > particularly tricky to understand, despite detailed comments, and after > finding myself rechecking it several times over the years, I added a > comment saying "This code is correct. Do not waste time checking it. If > there is a bug, it is somewhere else!". Some years later, long after I had > left that company, I received a note thanking me for how much time that > comment had saved! > > Jonathan Scott >