all i'm doing is copyng libz. they seem to have hit a sweet spot with that particular approach - i can't beat it (i can't even work out when urolling loops helps and when it doesn't without trying it). no idea why julia is ~10% faster (although i admit i haven't tried looking). andrew
On Sunday, 20 April 2014 17:13:52 UTC-3, Stefan Karpinski wrote: > > That's really impressive that you managed to get it that fast. > > > On Sun, Apr 20, 2014 at 1:35 PM, andrew cooke <[email protected]<javascript:> > > wrote: > >> >> Just for the record - multiple tables and unrolling in Julia now beats C >> (very slightly). >> >> Tim's @nexprs macro generally helps with the unrolling (although I seem >> to have hit a bug misunderstanding in one particular case, so am having to >> copy + paste in one place). >> >> Thanks, >> Andrew >> >> >> On Thursday, 10 April 2014 19:52:03 UTC-3, andrew cooke wrote: >>> >>> >>> huh. i had forgotten about this. >>> >>> i'll try four tables. it shouldn't be that hard to add (although >>> there's going to be extra book-keeping - it's not an obvious gain to me). >>> >>> cheers, >>> andrew >>> >>> On Thursday, 10 April 2014 19:08:21 UTC-3, Chris Foster wrote: >>>> >>>> On Fri, Apr 11, 2014 at 6:44 AM, Laszlo Hars <[email protected]> >>>> wrote: >>>> > note that the running time does not change with a partial loop >>>> unroll, like >>>> > this: >>>> > ~~~ >>>> > function signed_loop{D<:Unsigned, A<:Unsigned}(::Type{D}, r::A, data, >>>> > table::Vector{A}) >>>> > local j = 0 >>>> > for i = 1 : div(length(data),20) >>>> > r = (r >>> 8) $ table[1 + (data[j+=1]$convert(D,r))] >>>> [...] >>>> > r = (r >>> 8) $ table[1 + (data[j+=1]$convert(D,r))] >>>> > end >>>> > return r >>>> > end >>>> > ~~~ >>>> >>>> In that case, it's probably because zlib is processing the bytes four >>>> at a time, using four different CRC tables. This is quite distinct >>>> from the loop unrolling, and can have a larger effect because it >>>> removes some of the data dependency between iterations. It looks >>>> something like this (very untested! I didn't have time to figure out >>>> how to make the four different CRC tables.) >>>> >>>> data4 = reinterpret(Uint32, data) # note, need special cases for >>>> trailing bytes >>>> for i = 1:div(length(data4)) >>>> word::Uint32 = data4[i] >>>> r = r $ word >>>> r = table3[1 + (r & 0xff)] $ table2[1 + ((r >> 8) $ 0xff)] $ >>>> table1[1 + ((r >> 16) $ 0xff)] $ table0[1 + (r >> 24)] >>>> end >>>> >>> >
