On Wed, 20 Aug 2025 18:50:48 +0000, Ngan, Robert <robert.n...@dxc.com> wrote:
>I thought "All those vector instructions are better than an a single ED? >Today, I decided to recompile the module > > MVC 536(6,R13),2403(R3) # TS2=119 +2403 > LLH R2,8(,R2) # U010-CO-ID > CVD R2,1050(,R13) # > ED 536(6,R13),1055(R13) # TS2=119 > L R2,88(,R8) # BLL_9 > MVC 8(5,R2),537(R13) # U016-CO-ID-DISPLAY TS2=119 Does anyone believe IBM is committed to vector instructions when IBM Telum has tiny 16 byte vector registers compared to other CPU's (64, 128, 256 or 512 bytes). I'm guessing companies won't buy IBM without vector instructions proving you can't fix stupid. You can certainly find a use for these instructions but are they the best solution? Apparently, the Cobol compiler developers found out the hard way as Robert discovered. How are we so gullible that we believe everything Unix programmers say? Let's learn how vectors solve a simple problems the hard way. Vector instructions are critical to NON-zArch cpu performance. Let's consider MOVE A TO B because if they can't do the easy stuff right, what makes anyone think they are doing the difficult stuff right! 1. Traditionally, CPU's have 7 usable registers. MOVE A TO B is very simple but extremely slow: L R3,0(R1) get source data into reg ST R3,0(R2) save source data to destination AHI R1,## point to next source data AHI R2,## point to next destination repeat until all data moved. 4 of the 7 registers are needed to move data. Moving data is very expensive using this method. Ask yourself why 64 bit was so important. You could move 8 bytes instead of 4 bytes thus doubling CPU speed for moving data. 2. To move data using vector instructions is the same except it uses vector instructions and vector registers. 2a. There are 32 vector registers (same as zArch) 2b. some vector registers are 64 bytes (512 bits as in Intel x86 AVX-512). 64 bytes times 32 regs is 2KB. IBM Telum vector registers are only 16 bytes. 16 bytes times 32 regs is 512 bytes (25% of AVX-512). VL VR1,0(R1) get source data into reg VL VR2,64(R1) get source data into reg VL VR3,128(R1) get source data into reg VL VR4,192(R1) get source data into reg VST VR1,0(R2) save source data to destination VST VR2,64(R2) save source data to destination VST VR3,128(R2) save source data to destination VST VR4,192(R2) save source data to destination AHI R1,256 point to next source data AHI R2,256 point to next destination repeat until all data moved. Most architectures use 4 or fewer vector registers. They are moving 256 bytes at a time which is the equivalent of an MVC which IBM architected in the 1960's. Thank god for fast CPUs! 3. zArch MVC, MVCL, and MVCLE are exponentially faster than vector instructions: 3a. MVC and MVCL have existed since 1970 but never changed externally to improve performance. 3a. zArch microcode must be using hidden buffers instead of user facing vector registers. I'm guessing at least 4K or larger. Huge compared to vector moving 256 bytes (16X faster on 4K move). 3b. zArch appears to be using a masked move (mask being the size to move). A few vector architectures have masked but most implementations use calculations to determine power of 2 moves that fit in the remaining size (e.g. 1 byte, 2, 4, 8, ..., 256 bytes). 3c. Decode of one instruction MVCLE instead of potentially hundreds to thousands of instructions to decode. 3d. There's a reason IBM Telum pipeline is only 6 instructions versus 15 to 30 instructions of other architectures. 3e. MVCLE can easily calculate prefetch of storage to optimize move. Only a few non-IBM implementations use prefetch and they simply pick a number (e.g. 756 bytes). 3f. If zArch microcode used the vector implementation internally. then that's at least 19 instructions in L2 that don't need to be continually decoded. 3g. zArch vector instructions VLM and VSTM that can specify 16 vector registers (256 bytes) with 6 fewer instructions compared to above. 3g. I suspect that vector microcode resides in L2 cache because of low use. High use microcode probably resides permanently in L1 cache. Being microcode, it's probably permanently decoded. The computer industry (e.g. Google) is delusional to ignore the power of the IBM Telum with each core being several times faster than any other CPU core today because of it's instruction set.