Re: Pipeline question

David Bond Mon, 15 Aug 2011 09:42:05 -0700

On Tue, 9 Aug 2011 09:43:18 +0200, Fred van der Windt wrote:

>This is a small part of code from a routine that decodes Base64 encoded data:
>DO    FROM=(R2,)
>  LG    R0,0(,R4)         Load 8 source bytes
>  LA    R4,8(,R4)         R4 past 8 source bytes
>  RISBG R0,R0,06,11,02    aaabbbb0ccc0ddd0eee0fff0ggg0hhh0
>  RISBG R0,R0,12,17,04    aaabbbccccc0ddd0eee0fff0ggg0hhh0
>  RISBG R0,R0,18,23,06    aaabbbcccdddddd0eee0fff0ggg0hhh0
>  RISBG R0,R0,24,29,08    aaabbbcccdddeee0eee0fff0ggg0hhh0
>  RISBG R0,R0,30,35,10    aaabbbcccdddeeefffe0fff0ggg0hhh0
>  RISBG R0,R0,36,41,12    aaabbbcccdddeeefffgggff0ggg0hhh0
>  RISBG R0,R0,42,47,14    aaabbbcccdddeeefffggghhhggg0hhh0
>  STG   R0,0(R6)          Store 6 databytes
>  LA    R6,6(,R6)         R6 past 6 databytes
>ENDDO
>It takes 8 source bytes (already translated to contains 6 databits each),
combines the six databits in each byte to create 6 databytes. The code
contains a series of RISBG instructions that act on the same register. I
figured it might be a lot faster to use two registers and interleave these
operations:
>DO    FROM=(R2,)
>  LMG   R0,R1,0(,R4)      Load 16 source bytes
>  LA    R4,16(,R4)
>  RISBG R0,R0,06,11,02    aaabbbb0ccc0ddd0eee0fff0ggg0hhh0
>  RISBG R1,R1,06,11,02    aaabbbb0ccc0ddd0eee0fff0ggg0hhh0
>  RISBG R0,R0,12,17,04    aaabbbccccc0ddd0eee0fff0ggg0hhh0
>  RISBG R1,R1,12,17,04    aaabbbccccc0ddd0eee0fff0ggg0hhh0
>  RISBG R0,R0,18,23,06    aaabbbcccdddddd0eee0fff0ggg0hhh0
>  RISBG R1,R1,18,23,06    aaabbbcccdddddd0eee0fff0ggg0hhh0
>  RISBG R0,R0,24,29,08    aaabbbcccdddeee0eee0fff0ggg0hhh0
>  RISBG R1,R1,24,29,08    aaabbbcccdddeee0eee0fff0ggg0hhh0
>  RISBG R0,R0,30,35,10    aaabbbcccdddeeefffe0fff0ggg0hhh0
>  RISBG R1,R1,36,41,12    aaabbbcccdddeeefffgggff0ggg0hhh0
>  RISBG R0,R0,36,41,12    aaabbbcccdddeeefffgggff0ggg0hhh0
>  RISBG R1,R1,42,47,14    aaabbbcccdddeeefffggghhhggg0hhh0
>  RISBG R0,R0,42,47,14    aaabbbcccdddeeefffggghhhggg0hhh0
>  STG   R0,0(R6)          Store 6 databytes
>  STG   R1,6(R6)          Store 6 databytes
>  LA    R6,12(,R6)
>ENDDO
>This requires only half the number of iterations and have less pipeline
dependencies.
>
>I was quite surprised to find out that the second version takes about as
long as the first version (in some runs it even appears to be a liitle bit
slower). Have I introduced some weird dependency between the loads or two
stores or R0 and R1 that negates everything? Any suggestions?
>
>Fred!


I can see two problems.  The first, (smaller) problem is that LMG is slower
than 2 LG instructions.  And the entire LMG must complete before R0 can be
referenced.

The second problem is that the two STG instructions overlap.  The result of
first STG must be pushed into cache before the second STG can be started.
My guess is that this is the real problem.  And it won't help to store only
6 bytes of each register because 3/4 of the time the same doubleword in
cache will be changed. It might be better to use additional registers to
pack the results into 64 bits, store that, and then save the remaining 32
output bits for the next time through the loop.  It will probably be easier
and faster to unroll the loop again so that 4 doublewords are loaded and
processed, then pack the results into 3 64-bit registers and store 3
doublewords.

David Bond

Re: Pipeline question

Reply via email to