On Tue, 9 Aug 2011 09:43:18 +0200, Fred van der Windt wrote: >This is a small part of code from a routine that decodes Base64 encoded data: >DO FROM=(R2,) > LG R0,0(,R4) Load 8 source bytes > LA R4,8(,R4) R4 past 8 source bytes > RISBG R0,R0,06,11,02 aaabbbb0ccc0ddd0eee0fff0ggg0hhh0 > RISBG R0,R0,12,17,04 aaabbbccccc0ddd0eee0fff0ggg0hhh0 > RISBG R0,R0,18,23,06 aaabbbcccdddddd0eee0fff0ggg0hhh0 > RISBG R0,R0,24,29,08 aaabbbcccdddeee0eee0fff0ggg0hhh0 > RISBG R0,R0,30,35,10 aaabbbcccdddeeefffe0fff0ggg0hhh0 > RISBG R0,R0,36,41,12 aaabbbcccdddeeefffgggff0ggg0hhh0 > RISBG R0,R0,42,47,14 aaabbbcccdddeeefffggghhhggg0hhh0 > STG R0,0(R6) Store 6 databytes > LA R6,6(,R6) R6 past 6 databytes >ENDDO >It takes 8 source bytes (already translated to contains 6 databits each), combines the six databits in each byte to create 6 databytes. The code contains a series of RISBG instructions that act on the same register. I figured it might be a lot faster to use two registers and interleave these operations: >DO FROM=(R2,) > LMG R0,R1,0(,R4) Load 16 source bytes > LA R4,16(,R4) > RISBG R0,R0,06,11,02 aaabbbb0ccc0ddd0eee0fff0ggg0hhh0 > RISBG R1,R1,06,11,02 aaabbbb0ccc0ddd0eee0fff0ggg0hhh0 > RISBG R0,R0,12,17,04 aaabbbccccc0ddd0eee0fff0ggg0hhh0 > RISBG R1,R1,12,17,04 aaabbbccccc0ddd0eee0fff0ggg0hhh0 > RISBG R0,R0,18,23,06 aaabbbcccdddddd0eee0fff0ggg0hhh0 > RISBG R1,R1,18,23,06 aaabbbcccdddddd0eee0fff0ggg0hhh0 > RISBG R0,R0,24,29,08 aaabbbcccdddeee0eee0fff0ggg0hhh0 > RISBG R1,R1,24,29,08 aaabbbcccdddeee0eee0fff0ggg0hhh0 > RISBG R0,R0,30,35,10 aaabbbcccdddeeefffe0fff0ggg0hhh0 > RISBG R1,R1,36,41,12 aaabbbcccdddeeefffgggff0ggg0hhh0 > RISBG R0,R0,36,41,12 aaabbbcccdddeeefffgggff0ggg0hhh0 > RISBG R1,R1,42,47,14 aaabbbcccdddeeefffggghhhggg0hhh0 > RISBG R0,R0,42,47,14 aaabbbcccdddeeefffggghhhggg0hhh0 > STG R0,0(R6) Store 6 databytes > STG R1,6(R6) Store 6 databytes > LA R6,12(,R6) >ENDDO >This requires only half the number of iterations and have less pipeline dependencies. > >I was quite surprised to find out that the second version takes about as long as the first version (in some runs it even appears to be a liitle bit slower). Have I introduced some weird dependency between the loads or two stores or R0 and R1 that negates everything? Any suggestions? > >Fred!
I can see two problems. The first, (smaller) problem is that LMG is slower than 2 LG instructions. And the entire LMG must complete before R0 can be referenced. The second problem is that the two STG instructions overlap. The result of first STG must be pushed into cache before the second STG can be started. My guess is that this is the real problem. And it won't help to store only 6 bytes of each register because 3/4 of the time the same doubleword in cache will be changed. It might be better to use additional registers to pack the results into 64 bits, store that, and then save the remaining 32 output bits for the next time through the loop. It will probably be easier and faster to unroll the loop again so that 4 doublewords are loaded and processed, then pack the results into 3 64-bit registers and store 3 doublewords. David Bond
