Yes, the code generation is pretty bad. You're writing out the full
expression to compute the inputs to the computation repeatedly. Instead,
generate code that computes each subexpression, assigns it to a variable
and then each place where that occurs later, just use the variable. The
compiler is essentially doing that for you now, by detecting common
subexpressions.

On Thu, Jan 14, 2016 at 10:11 AM, <[email protected]> wrote:

> I agree it works, but Julia crashes/can't compile with 25 coefficients:
>
>
> @clenshaw(x,1.0,1/2,1/3,1/4,1/5,1/6,1/7,1/8,1/9,1/10,1/11,1/12,1/13,1/14,1/15,1/16,1/17,1/18,1/19,1/20,1/21,1/22,1/23,1/24,1/25)
>
> The compilation time is also exorbitant for 24 coefficients:
>
> julia> g(x) =
> @clenshaw(x,1.0,1/2,1/3,1/4,1/5,1/6,1/7,1/8,1/9,1/10,1/11,1/12,1/13,1/14,1/15,1/16,1/17,1/18,1/19,1/20,1/21,1/22,1/23,1/24)
>
> g (generic function with 1 method)
>
>
> julia> @time g(1.0)
>
>  39.776082 seconds (131.13 M allocations: 3.984 GB, 7.63% gc time)
>
> 3.775958177753509
>
>
> julia> @time g(1.0)
>
>   0.000001 seconds (5 allocations: 176 bytes)
>
> 3.775958177753509
>
>
>
> On Thursday, January 14, 2016 at 3:06:15 PM UTC, Stefan Karpinski wrote:
>
>> I get this LLVM code:
>>
>> ulia> @code_llvm g(1.0)
>>
>> define double @julia_g_23965(double) {
>> top:
>>   %1 = fmul double %0, 2.000000e+00
>>   %2 = fmul double %1, 5.000000e-01
>>   %3 = fmul fast double %0, 1.000000e-01
>>   %4 = fadd fast double %3, 0x3FAAF286BCA1AF28
>>   %5 = fmul fast double %1, %4
>>   %6 = fadd fast double %5, 0x3F76C16C16C16C10
>>   %7 = fsub double 0x3FAE1E1E1E1E1E1E, %4
>>   %8 = fmul fast double %1, %6
>>   %9 = fadd fast double %7, %8
>>   %10 = fsub double 6.250000e-02, %6
>>   %11 = fmul fast double %1, %9
>>   %12 = fadd fast double %10, %11
>>   %13 = fsub double 0x3FB1111111111111, %9
>>   %14 = fmul fast double %1, %12
>>   %15 = fadd fast double %13, %14
>>   %16 = fsub double 0x3FB2492492492492, %12
>>   %17 = fmul fast double %1, %15
>>   %18 = fadd fast double %16, %17
>>   %19 = fsub double 0x3FB3B13B13B13B14, %15
>>   %20 = fmul fast double %1, %18
>>   %21 = fadd fast double %19, %20
>>   %22 = fsub double 0x3FB5555555555555, %18
>>   %23 = fmul fast double %1, %21
>>   %24 = fadd fast double %22, %23
>>   %25 = fsub double 0x3FB745D1745D1746, %21
>>   %26 = fmul fast double %1, %24
>>   %27 = fadd fast double %25, %26
>>   %28 = fsub double 1.000000e-01, %24
>>   %29 = fmul fast double %1, %27
>>   %30 = fadd fast double %28, %29
>>   %31 = fsub double 0x3FBC71C71C71C71C, %27
>>   %32 = fmul fast double %1, %30
>>   %33 = fadd fast double %31, %32
>>   %34 = fsub double 1.250000e-01, %30
>>   %35 = fmul fast double %1, %33
>>   %36 = fadd fast double %34, %35
>>   %37 = fsub double 0x3FC2492492492492, %33
>>   %38 = fmul fast double %1, %36
>>   %39 = fadd fast double %37, %38
>>   %40 = fsub double 0x3FC5555555555555, %36
>>   %41 = fmul fast double %1, %39
>>   %42 = fadd fast double %40, %41
>>   %43 = fsub double 2.000000e-01, %39
>>   %44 = fmul fast double %1, %42
>>   %45 = fadd fast double %43, %44
>>   %46 = fsub double 2.500000e-01, %42
>>   %47 = fmul fast double %1, %45
>>   %48 = fadd fast double %46, %47
>>   %49 = fsub double 0x3FD5555555555555, %45
>>   %50 = fmul fast double %1, %48
>>   %51 = fadd fast double %49, %50
>>   %52 = fsub double 5.000000e-01, %48
>>   %53 = fmul fast double %1, %51
>>   %54 = fadd fast double %52, %53
>>   %55 = fsub double 1.000000e+00, %51
>>   %56 = fmul fast double %2, %54
>>   %57 = fadd fast double %55, %56
>>   ret double %57
>> }
>>
>>
>> Is that not what you were hoping for?
>>
>> On Thu, Jan 14, 2016 at 8:09 AM, <[email protected]> wrote:
>>
>>> This macro:
>>>
>>> macro clenshaw(x, c...)
>>>     bk1,bk2 = :(zero(t)),:(zero(t))
>>>     N = length(c)
>>>     for k = N:-1:2
>>>         bk2, bk1 = bk1, :(muladd(t,$bk1,$(esc(c[k]))-$bk2))
>>>     end
>>>     ex = :(muladd(t/2,$bk1,$(esc(c[1]))-$bk2))
>>>     Expr(:block, :(t = $(esc(2))*$(esc(x))), ex)
>>> end
>>>
>>> implements Clenshaw's algorithm to sum Chebyshev series. It successfully
>>> "unrolls" the loop, but is impractical for more than 24 coefficients. The
>>> resulting LLVM code is theoretically only 50% longer than unrolling
>>> Horner's rule:
>>>
>>> f(x) =
>>> @evalpoly(x,1.0,1/2,1/3,1/4,1/5,1/6,1/7,1/8,1/9,1/10,1/11,1/12,1/13,1/14,1/15,1/16,1/17,1/18,1/19,1/20)
>>>
>>> @code_llvm f(1.0)
>>>
>>> g(x) =
>>> @clenshaw(x,1.0,1/2,1/3,1/4,1/5,1/6,1/7,1/8,1/9,1/10,1/11,1/12,1/13,1/14,1/15,1/16,1/17,1/18,1/19,1/20)
>>>
>>> @code_llvm g(1.0)
>>>
>>> How could I write the macro differently? How else could I end up with
>>> the same efficient LLVM code? using a staged function?
>>>
>>
>>

Reply via email to