all i'm doing is copyng libz.  they seem to have hit a sweet spot with that 
particular approach - i can't beat it (i can't even work out when urolling 
loops helps and when it doesn't without trying it).  no idea why julia is 
~10% faster (although i admit i haven't tried looking).  andrew

On Sunday, 20 April 2014 17:13:52 UTC-3, Stefan Karpinski wrote:
>
> That's really impressive that you managed to get it that fast.
>
>
> On Sun, Apr 20, 2014 at 1:35 PM, andrew cooke <[email protected]<javascript:>
> > wrote:
>
>>
>> Just for the record - multiple tables and unrolling in Julia now beats C 
>> (very slightly).
>>
>> Tim's @nexprs macro generally helps with the unrolling (although I seem 
>> to have hit a bug misunderstanding in one particular case, so am having to 
>> copy + paste in one place).
>>
>> Thanks,
>> Andrew
>>
>>
>> On Thursday, 10 April 2014 19:52:03 UTC-3, andrew cooke wrote:
>>>
>>>
>>> huh.  i had forgotten about this.
>>>
>>> i'll try four tables.  it shouldn't be that hard to add (although 
>>> there's going to be extra book-keeping - it's not an obvious gain to me).
>>>
>>> cheers,
>>> andrew
>>>
>>> On Thursday, 10 April 2014 19:08:21 UTC-3, Chris Foster wrote:
>>>>
>>>> On Fri, Apr 11, 2014 at 6:44 AM, Laszlo Hars <[email protected]> 
>>>> wrote: 
>>>> > note that the running time does not change with a partial loop 
>>>> unroll, like 
>>>> > this: 
>>>> > ~~~ 
>>>> > function signed_loop{D<:Unsigned, A<:Unsigned}(::Type{D}, r::A, data, 
>>>> > table::Vector{A}) 
>>>> >     local j = 0 
>>>> >      for i = 1 : div(length(data),20) 
>>>> >         r = (r >>> 8) $ table[1 + (data[j+=1]$convert(D,r))] 
>>>> [...] 
>>>> >         r = (r >>> 8) $ table[1 + (data[j+=1]$convert(D,r))] 
>>>> >     end 
>>>> >     return r 
>>>> > end 
>>>> > ~~~ 
>>>>
>>>> In that case, it's probably because zlib is processing the bytes four 
>>>> at a time, using four different CRC tables.  This is quite distinct 
>>>> from the loop unrolling, and can have a larger effect because it 
>>>> removes some of the data dependency between iterations.  It looks 
>>>> something like this (very untested!  I didn't have time to figure out 
>>>> how to make the four different CRC tables.) 
>>>>
>>>> data4 = reinterpret(Uint32, data)  # note, need special cases for 
>>>> trailing bytes 
>>>> for i = 1:div(length(data4)) 
>>>>     word::Uint32 = data4[i] 
>>>>     r = r $ word 
>>>>     r = table3[1 + (r & 0xff)] $ table2[1 + ((r >> 8) $ 0xff)] $ 
>>>> table1[1 + ((r >> 16) $ 0xff)] $ table0[1 + (r >> 24)] 
>>>> end 
>>>>
>>>
>

Reply via email to