huh. i had forgotten about this. i'll try four tables. it shouldn't be that hard to add (although there's going to be extra book-keeping - it's not an obvious gain to me).
cheers, andrew On Thursday, 10 April 2014 19:08:21 UTC-3, Chris Foster wrote: > > On Fri, Apr 11, 2014 at 6:44 AM, Laszlo Hars > <[email protected]<javascript:>> > wrote: > > note that the running time does not change with a partial loop unroll, > like > > this: > > ~~~ > > function signed_loop{D<:Unsigned, A<:Unsigned}(::Type{D}, r::A, data, > > table::Vector{A}) > > local j = 0 > > for i = 1 : div(length(data),20) > > r = (r >>> 8) $ table[1 + (data[j+=1]$convert(D,r))] > [...] > > r = (r >>> 8) $ table[1 + (data[j+=1]$convert(D,r))] > > end > > return r > > end > > ~~~ > > In that case, it's probably because zlib is processing the bytes four > at a time, using four different CRC tables. This is quite distinct > from the loop unrolling, and can have a larger effect because it > removes some of the data dependency between iterations. It looks > something like this (very untested! I didn't have time to figure out > how to make the four different CRC tables.) > > data4 = reinterpret(Uint32, data) # note, need special cases for trailing > bytes > for i = 1:div(length(data4)) > word::Uint32 = data4[i] > r = r $ word > r = table3[1 + (r & 0xff)] $ table2[1 + ((r >> 8) $ 0xff)] $ > table1[1 + ((r >> 16) $ 0xff)] $ table0[1 + (r >> 24)] > end >
