Pipeline question

Fred van der Windt Tue, 09 Aug 2011 00:46:04 -0700

This is a small part of code from a routine that decodes Base64 encoded data:
DO    FROM=(R2,)
  LG    R0,0(,R4)         Load 8 source bytes
  LA    R4,8(,R4)         R4 past 8 source bytes
  RISBG R0,R0,06,11,02    aaabbbb0ccc0ddd0eee0fff0ggg0hhh0
  RISBG R0,R0,12,17,04    aaabbbccccc0ddd0eee0fff0ggg0hhh0
  RISBG R0,R0,18,23,06    aaabbbcccdddddd0eee0fff0ggg0hhh0
  RISBG R0,R0,24,29,08    aaabbbcccdddeee0eee0fff0ggg0hhh0
  RISBG R0,R0,30,35,10    aaabbbcccdddeeefffe0fff0ggg0hhh0
  RISBG R0,R0,36,41,12    aaabbbcccdddeeefffgggff0ggg0hhh0
  RISBG R0,R0,42,47,14    aaabbbcccdddeeefffggghhhggg0hhh0
  STG   R0,0(R6)          Store 6 databytes
  LA    R6,6(,R6)         R6 past 6 databytes
ENDDO
It takes 8 source bytes (already translated to contains 6 databits each), 
combines the six databits in each byte to create 6 databytes. The code contains 
a series of RISBG instructions that act on the same register. I figured it 
might be a lot faster to use two registers and interleave these operations:
DO    FROM=(R2,)
  LMG   R0,R1,0(,R4)      Load 16 source bytes
  LA    R4,16(,R4)
  RISBG R0,R0,06,11,02    aaabbbb0ccc0ddd0eee0fff0ggg0hhh0
  RISBG R1,R1,06,11,02    aaabbbb0ccc0ddd0eee0fff0ggg0hhh0
  RISBG R0,R0,12,17,04    aaabbbccccc0ddd0eee0fff0ggg0hhh0
  RISBG R1,R1,12,17,04    aaabbbccccc0ddd0eee0fff0ggg0hhh0
  RISBG R0,R0,18,23,06    aaabbbcccdddddd0eee0fff0ggg0hhh0
  RISBG R1,R1,18,23,06    aaabbbcccdddddd0eee0fff0ggg0hhh0
  RISBG R0,R0,24,29,08    aaabbbcccdddeee0eee0fff0ggg0hhh0
  RISBG R1,R1,24,29,08    aaabbbcccdddeee0eee0fff0ggg0hhh0
  RISBG R0,R0,30,35,10    aaabbbcccdddeeefffe0fff0ggg0hhh0
  RISBG R1,R1,36,41,12    aaabbbcccdddeeefffgggff0ggg0hhh0
  RISBG R0,R0,36,41,12    aaabbbcccdddeeefffgggff0ggg0hhh0
  RISBG R1,R1,42,47,14    aaabbbcccdddeeefffggghhhggg0hhh0
  RISBG R0,R0,42,47,14    aaabbbcccdddeeefffggghhhggg0hhh0
  STG   R0,0(R6)          Store 6 databytes
  STG   R1,6(R6)          Store 6 databytes
  LA    R6,12(,R6)
ENDDO
This requires only half the number of iterations and have less pipeline 
dependencies.


I was quite surprised to find out that the second version takes about as long 
as the first version (in some runs it even appears to be a liitle bit slower). 
Have I introduced some weird dependency between the loads or two stores or R0 
and R1 that negates everything? Any suggestions?

Fred!

-----------------------------------------------------------------
ATTENTION:
The information in this electronic mail message is private and
confidential, and only intended for the addressee. Should you
receive this message by mistake, you are hereby notified that
any disclosure, reproduction, distribution or use of this
message is strictly prohibited. Please inform the sender by
reply transmission and delete the message without copying or
opening it.

Messages and attachments are scanned for all viruses known.
If this message contains password-protected attachments, the
files have NOT been scanned for viruses by the ING mail domain.
Always scan attachments before opening them.
-----------------------------------------------------------------

Pipeline question

Reply via email to