DO FROM=(R2,)
LMG R0,R1,0(,R4) Load 16 source bytes
LA R4,16(,R4)
RISBG R0,R0,06,11,02 aaabbbb0ccc0ddd0eee0fff0ggg0hhh0
RISBG R1,R1,06,11,02 aaabbbb0ccc0ddd0eee0fff0ggg0hhh0
RISBG R0,R0,12,17,04 aaabbbccccc0ddd0eee0fff0ggg0hhh0
RISBG R1,R1,12,17,04 aaabbbccccc0ddd0eee0fff0ggg0hhh0
RISBG R0,R0,18,23,06 aaabbbcccdddddd0eee0fff0ggg0hhh0
RISBG R1,R1,18,23,06 aaabbbcccdddddd0eee0fff0ggg0hhh0
RISBG R0,R0,24,29,08 aaabbbcccdddeee0eee0fff0ggg0hhh0
RISBG R1,R1,24,29,08 aaabbbcccdddeee0eee0fff0ggg0hhh0
RISBG R0,R0,30,35,10 aaabbbcccdddeeefffe0fff0ggg0hhh0
RISBG R1,R1,36,41,12 aaabbbcccdddeeefffgggff0ggg0hhh0
RISBG R0,R0,36,41,12 aaabbbcccdddeeefffgggff0ggg0hhh0
RISBG R1,R1,42,47,14 aaabbbcccdddeeefffggghhhggg0hhh0
RISBG R0,R0,42,47,14 aaabbbcccdddeeefffggghhhggg0hhh0
STG R0,0(R6) Store 6 databytes
STG R1,6(R6) Store 6 databytes
LA R6,12(,R6)
ENDDO
This requires only half the number of iterations and have less pipeline
dependencies.
I was quite surprised to find out that the second version takes about as
long as the first version (in some runs it even appears to be a liitle
bit slower). Have I introduced some weird dependency between the loads
or two stores or R0 and R1 that negates everything? Any suggestions?