Re: [A51] CPU implementation (more code)

sascha Mon, 02 Nov 2009 14:21:16 -0800

Regarding CPU implementations:

I have attached a 64 wide SIMD implementation (single instruction 64
bits of data)
that does 8 million a5/1 rounds per second on a single core
of a 2ghz core2duo. I am not sure whether -O3 uses SSE automagically,
so i assume not.
I use 64 64bit wide memory locations (so effectively L1 cache) to
store a51 registers vertically, and so i can generate a keystream bit
for 64 independent a5/1 chains with a single set of instructions:


(r1<18>(regs) ^ r2<21>(regs) ^ r3<22>(regs));

so r1<18>(regs) points to a 64bit memory location that stores bit 18
of R1 for 64 A5/1 engines.

the performance bottleneck is the shifting of R1,R2,R3, since you have
to touch each register.

the calculation of the keystream shown above uses 2 xor instructions,
but since you calculate 64bits, effectively only 2/64 xor per
chain is needed.
The shift, which is essentially this:
  for (i = base - length + 1; i < base; ++i) {
    regs[i] = regs[i + nshift] & clock2 | regs[i + nshift + 1] & clock3;
  }
uses 6 instructions 64 times (effectively 6 instructions per chain).

with 8M A51 rounds per second (each round calculating 64bit), i can generate
512 million keystream bits per second, at 2ghz that is only 4 clocks per
bit.

it does not yet run very well on NVIDIA though, but i expect the playstation
with 128bit registers to perform very well using this technique.

Maybe someone can port that to SSE or Frank you can try to write
an ATI implementation. if it is possible, those ATI cards would skyrocket,
since with proper code generation this implementation can be made to
store values in registers only and not use shared memory which is scarce
on ATI. But if IIRC, ati gpus have 256 registers per thread, and only
64 * 3 are needed. (64 for R1,R2,R3; 64 for the result; 64 for the round
function values; and some for the calculations)

The handling of the round function is not implemented properly.
If you want to calculate values to check against the reference implementation,
you should calculate like 1000 A5/1 rounds (so the round function does not
change).
With advance 0, where the round function value is 0 and value ^ 0 == value,
the reference implementation also calculates round function-less chains.

> Nice work Frank.
> 
> Just wondering, did anyone ever look at using SSE2 for faster CPU code 
> (or maybe even tried)?
> Haven't looked at A5/1 very well, but skipping through A5Cpu.cpp (this 
> is the relevant code, right?) the main loop seems to consist of mostly 
> bitshifts, xors, ors and ands. That should perform alright in SSE.
> 
> Daniël
> 
> Frank A. Stevenson schreef:
> > I cleaned up the ATI brook code, and made a new version of the shared 
> > library, that uses only the CPU for generating chains. The code can be 
> > found here (linux only ATM):
> >
> > http://traxme.net/a5/a5_cpu.tar.gz
> >
> > There is a small python frontend that will generate real chains, (follow 
> > the instructions in my previous post about using the python script) - 
> > also the script can be edited to set the number of threads (cores) you 
> > wish to use.
> >
> > An AMD Phenom x4 @ 3.2 GHz makes around 16 chains / second. I suppose 
> > there is room for assembly optimization here, but that isn't really the 
> > point. I am writing this code, to look into efficient table lookup once 
> > that tables have been generated. The idea is to spread the lookup part 
> > to machines that may not have a GPU.
> >
> > On my machine the code will cause 32 lookups to disk / second, hardly a 
> > cause for alarm, so a bog standard hard disk will do.
> >
> > But if the GPU is used for lookup, the rate will be much higher 
> > (320/sec) - and I am currently copying sorted tables to a slow USB flash 
> > drive, to determine of it can keep up with the pace with respect to 
> > lookup / reads. (It takes some hours to copy 1 million files down to 
> > this device, so I may change my approach to sorting, to something that 
> > is more in line with the 64kb block size commonly found on 8GB flash disks
> >
> > cheers,
> >   Frank
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > A51 mailing list
> > [email protected]
> > http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
> >
> >   
> _______________________________________________
> A51 mailing list
> [email protected]
> http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51

#include <iostream>
#include <iomanip>
#include <stdint.h>
#include <stdlib.h>
//#include <xmmintrin.h>

template <typename T>
void vconvert(T * in, T * out, int n, int shift, T mask) {
  T notmask = ~mask;
  for (int i = 0, j = n / 2, outi = 0; i < n / 2; ++i, ++j, outi += 2) {
    out[outi] = in[i] & mask | in[j] << shift & notmask;
    out[outi + 1] = in[j] & notmask | in[i] >> shift & mask;
  }
}

uint64_t masks[6] = {
  0x00000000ffffffff,
  0x0000ffff,
  0x00ff00ff,
  0x0f0f0f0f,
  0x33333333,
  0x55555555
};

template <typename T>
T * do_vconvert(T * in, T * out, int n, uint64_t * masks) {
  int shift = 32;
  for (int i = 0; i < 6; ++i, shift >>= 1) {
    vconvert(in, out, n, shift, masks[i]);
    T * tmp = in;
    in = out;
    out = tmp;
  }
  return in;
}

#define R1_BASE 23 + 22 + 18
#define R2_BASE 23 + 21
#define R3_BASE 22

#define R1_SIZE 19
#define R2_SIZE 22
#define R3_SIZE 23

#define R1_CLOCK 8
#define R2_CLOCK 10
#define R3_CLOCK 10

template <int BIT, typename Regs>
uint64_t & r1(Regs & regs) {
  return regs[R1_BASE - BIT];
}
template <int BIT, typename Regs>
uint64_t & r2(Regs & regs) {
  return regs[R2_BASE - BIT];
}
template <int BIT, typename Regs>
uint64_t & r3(Regs & regs) {
  return regs[R3_BASE - BIT];
}

template <int base, int n>
struct zero_reg_helper {
  template <typename Regs>
  static void doit(Regs & regs) {
    regs[base] = 0;
    zero_reg_helper<base + 1, n - 1>::doit(regs);
  }
};
template <int base>
struct zero_reg_helper<base, 0> {
  template <typename Regs>
  static void doit(Regs & regs) {
  }
};

template <int base, int n, typename Regs>
void zero_reg(Regs & regs) {
  zero_reg_helper<base, n>::doit(regs);
}

template <int base, int length, int nshift, typename Regs>
void lsh_reg(Regs & regs, uint64_t clock3) {
  uint64_t clock2 = ~clock3;
  int i;
  for (i = base - length + 1; i < base; ++i) {
    regs[i] = regs[i + nshift] & clock2 | regs[i + nshift + 1] & clock3;
  }
  regs[i] = regs[i + nshift] & clock2;
  //  zero_reg<base + 2, 2>(regs);
}

struct hexout1 {
};

hexout1 hexout;

std::ostream & operator<<(std::ostream & os, const hexout1 &) {
  os << std::hex << std::setw(16) << std::setfill('0');
  return os;
}

void crunch(uint64_t * in, uint64_t * masks, int runmax) {
  uint64_t regs_arr[64];
  uint64_t * regs;
  uint64_t result[64];
  uint64_t rfunc[64];
  uint64_t DP = 0xffffffffffffffff;

  uint64_t * result_ptr;
  
  for (int i = 0; i < 64; ++i) {
    result[i] = in[i];
  }
  regs = do_vconvert(result, regs_arr, 64, masks);

  result_ptr = result;
  if (result_ptr == regs) {
    result_ptr = regs_arr;
  }

  for (int run = 0; run < runmax; ++run) {

    //    for (int i = 0; i < 64; ++i) {
    //      regs[i] ^= rfunc[i];
    //    }

    uint64_t new_dp = 0;

    for (int round = 0; round < 64; ++round) {
      *result_ptr =
	*result_ptr & ~DP
	|
	(r1<18>(regs) ^ r2<21>(regs) ^ r3<22>(regs)) & DP;

      if (round < 15) {
	new_dp |= *result_ptr;
      }

      ++result_ptr;
      uint64_t r1c = r1<R1_CLOCK>(regs);
      uint64_t r2c = r2<R2_CLOCK>(regs);
      uint64_t r3c = r3<R3_CLOCK>(regs);
    
      uint64_t r1r2c = ~(r1c ^ r2c);
      uint64_t r1r3c = ~(r1c ^ r3c);
      uint64_t r2r3c = ~(r2c ^ r3c);

      uint64_t do_clock_r1 = r1r2c | r1r3c;
      uint64_t do_clock_r2 = r2r3c | r1r2c;
      uint64_t do_clock_r3 = r1r3c | r2r3c;

      uint64_t r1_fb = (r1<18>(regs) ^ r1<17>(regs) ^ r1<16>(regs) ^ r1<13>(regs)) & do_clock_r1;
      uint64_t r2_fb = (r2<21>(regs) ^ r2<20>(regs)) & do_clock_r2;
      uint64_t r3_fb = (r3<22>(regs) ^ r3<21>(regs) ^ r3<20>(regs) ^ r3<7>(regs)) & do_clock_r3;

      lsh_reg<R1_BASE, R1_SIZE, 0>(regs, do_clock_r1);
      lsh_reg<R2_BASE, R2_SIZE, 0>(regs, do_clock_r2);
      lsh_reg<R3_BASE, R3_SIZE, 0>(regs, do_clock_r3);
      
      r1<0>(regs) |= r1_fb;
      r2<0>(regs) |= r2_fb;
      r3<0>(regs) |= r3_fb;
    }
    DP = new_dp;

    uint64_t * tmp = regs;
    regs = result_ptr - 64;
    result_ptr = tmp;
  }
  regs = do_vconvert(regs,
		     result_ptr,
		     64, masks);
  for (int i = 0; i < 64; ++i) {
    in[i] = regs[i];
  }
}

int main(int argc, char ** argv) {
  int maxround = atoi(argv[1]);
  uint64_t in[64];

  for (int i = 1; i < 6; ++i) {
    masks[i] |= masks[i] << 32;
  }

  srand(time(NULL));

  for (uint64_t i = 0; i < 64; i++) {
    //in[i] = 0x1212121212121212;
    //in[i] = 0;
    in[i] = rand() | uint64_t(rand()) << 32;
  }

  uint64_t savein[64];
  std::copy(in, in + 64, savein);

  crunch(in, masks, maxround);

  for (int i = 0; i < 64; ++i) {
    std::cerr << hexout << savein[i] << " " << hexout << in[i] << "\n";
  }
}

_______________________________________________
A51 mailing list
[email protected]
http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51

Re: [A51] CPU implementation (more code)

Reply via email to