This is fascinating, Jon, and I'm sure not disputing any of what you say, just have a basic question:
Who is using expensive IBM Z MIPS for AI? Anybody? Why? And I mean a real use case, not tinkering or one, very narrow application. I mean something that could justify a z17. Yeah, yeah, I know, "closer to the data". Not at those prices: copy the data and use cheap cycles. I *want* to believe in anything that makes IBM Z valuable, don't get me wrong--but all I've ever heard is "Z good, AI good, therefore AI on Z good" and that's not a syllogism that a priori makes sense. And from what I've heard out of POK in the last few years, the mere fact that they did Telum and Spyre doesn't convince me that there's a real business case there. -----Original Message----- From: IBM Mainframe Assembler List <ASSEMBLER-LIST@LISTSERV.UGA.EDU> On Behalf Of Jon Perryman Sent: Thursday, August 21, 2025 4:56 PM To: ASSEMBLER-LIST@LISTSERV.UGA.EDU Subject: Vector instruction performance WAS: Execute-Type Instructions On Wed, 20 Aug 2025 18:50:48 +0000, Ngan, Robert <robert.n...@dxc.com> wrote: >I thought "All those vector instructions are better than an a single ED? >Today, I decided to recompile the module > > MVC 536(6,R13),2403(R3) # TS2=119 +2403 > LLH R2,8(,R2) # U010-CO-ID > CVD R2,1050(,R13) # > ED 536(6,R13),1055(R13) # TS2=119 > L R2,88(,R8) # BLL_9 > MVC 8(5,R2),537(R13) # U016-CO-ID-DISPLAY TS2=119 Does anyone believe IBM is committed to vector instructions when IBM Telum has tiny 16 byte vector registers compared to other CPU's (64, 128, 256 or 512 bytes). I'm guessing companies won't buy IBM without vector instructions proving you can't fix stupid. You can certainly find a use for these instructions but are they the best solution? Apparently, the Cobol compiler developers found out the hard way as Robert discovered. How are we so gullible that we believe everything Unix programmers say? Let's learn how vectors solve a simple problems the hard way. Vector instructions are critical to NON-zArch cpu performance. Let's consider MOVE A TO B because if they can't do the easy stuff right, what makes anyone think they are doing the difficult stuff right! 1. Traditionally, CPU's have 7 usable registers. MOVE A TO B is very simple but extremely slow: L R3,0(R1) get source data into reg ST R3,0(R2) save source data to destination AHI R1,## point to next source data AHI R2,## point to next destination repeat until all data moved. 4 of the 7 registers are needed to move data. Moving data is very expensive using this method. Ask yourself why 64 bit was so important. You could move 8 bytes instead of 4 bytes thus doubling CPU speed for moving data. 2. To move data using vector instructions is the same except it uses vector instructions and vector registers. 2a. There are 32 vector registers (same as zArch) 2b. some vector registers are 64 bytes (512 bits as in Intel x86 AVX-512). 64 bytes times 32 regs is 2KB. IBM Telum vector registers are only 16 bytes. 16 bytes times 32 regs is 512 bytes (25% of AVX-512). VL VR1,0(R1) get source data into reg VL VR2,64(R1) get source data into reg VL VR3,128(R1) get source data into reg VL VR4,192(R1) get source data into reg VST VR1,0(R2) save source data to destination VST VR2,64(R2) save source data to destination VST VR3,128(R2) save source data to destination VST VR4,192(R2) save source data to destination AHI R1,256 point to next source data AHI R2,256 point to next destination repeat until all data moved. Most architectures use 4 or fewer vector registers. They are moving 256 bytes at a time which is the equivalent of an MVC which IBM architected in the 1960's. Thank god for fast CPUs! 3. zArch MVC, MVCL, and MVCLE are exponentially faster than vector instructions: 3a. MVC and MVCL have existed since 1970 but never changed externally to improve performance. 3a. zArch microcode must be using hidden buffers instead of user facing vector registers. I'm guessing at least 4K or larger. Huge compared to vector moving 256 bytes (16X faster on 4K move). 3b. zArch appears to be using a masked move (mask being the size to move). A few vector architectures have masked but most implementations use calculations to determine power of 2 moves that fit in the remaining size (e.g. 1 byte, 2, 4, 8, ..., 256 bytes). 3c. Decode of one instruction MVCLE instead of potentially hundreds to thousands of instructions to decode. 3d. There's a reason IBM Telum pipeline is only 6 instructions versus 15 to 30 instructions of other architectures. 3e. MVCLE can easily calculate prefetch of storage to optimize move. Only a few non-IBM implementations use prefetch and they simply pick a number (e.g. 756 bytes). 3f. If zArch microcode used the vector implementation internally. then that's at least 19 instructions in L2 that don't need to be continually decoded. 3g. zArch vector instructions VLM and VSTM that can specify 16 vector registers (256 bytes) with 6 fewer instructions compared to above. 3g. I suspect that vector microcode resides in L2 cache because of low use. High use microcode probably resides permanently in L1 cache. Being microcode, it's probably permanently decoded. The computer industry (e.g. Google) is delusional to ignore the power of the IBM Telum with each core being several times faster than any other CPU core today because of it's instruction set.