That's really impressive that you managed to get it that fast.
On Sun, Apr 20, 2014 at 1:35 PM, andrew cooke <[email protected]> wrote: > > Just for the record - multiple tables and unrolling in Julia now beats C > (very slightly). > > Tim's @nexprs macro generally helps with the unrolling (although I seem to > have hit a bug misunderstanding in one particular case, so am having to > copy + paste in one place). > > Thanks, > Andrew > > > On Thursday, 10 April 2014 19:52:03 UTC-3, andrew cooke wrote: >> >> >> huh. i had forgotten about this. >> >> i'll try four tables. it shouldn't be that hard to add (although there's >> going to be extra book-keeping - it's not an obvious gain to me). >> >> cheers, >> andrew >> >> On Thursday, 10 April 2014 19:08:21 UTC-3, Chris Foster wrote: >>> >>> On Fri, Apr 11, 2014 at 6:44 AM, Laszlo Hars <[email protected]> >>> wrote: >>> > note that the running time does not change with a partial loop unroll, >>> like >>> > this: >>> > ~~~ >>> > function signed_loop{D<:Unsigned, A<:Unsigned}(::Type{D}, r::A, data, >>> > table::Vector{A}) >>> > local j = 0 >>> > for i = 1 : div(length(data),20) >>> > r = (r >>> 8) $ table[1 + (data[j+=1]$convert(D,r))] >>> [...] >>> > r = (r >>> 8) $ table[1 + (data[j+=1]$convert(D,r))] >>> > end >>> > return r >>> > end >>> > ~~~ >>> >>> In that case, it's probably because zlib is processing the bytes four >>> at a time, using four different CRC tables. This is quite distinct >>> from the loop unrolling, and can have a larger effect because it >>> removes some of the data dependency between iterations. It looks >>> something like this (very untested! I didn't have time to figure out >>> how to make the four different CRC tables.) >>> >>> data4 = reinterpret(Uint32, data) # note, need special cases for >>> trailing bytes >>> for i = 1:div(length(data4)) >>> word::Uint32 = data4[i] >>> r = r $ word >>> r = table3[1 + (r & 0xff)] $ table2[1 + ((r >> 8) $ 0xff)] $ >>> table1[1 + ((r >> 16) $ 0xff)] $ table0[1 + (r >> 24)] >>> end >>> >>
