These are definitely not stupid questions. Tuning performance at this
level is very hard. I'll just add a few practical notes on
micro-benchmarking and introspection. Inspecting and benchmarking
splatted/slurped functions can be very tricky. One thing to remember is
that `@code_llvm` displays the code for the method returned by `@which`.
This means that if you try to look at the generated code for a slurped
function definition, you won't see the fully specialized case. Let's
modify Stefan's example above slightly:
julia> g2(a, bs...) = +(a, bs...)
g2 (generic function with 1 method)
julia> @code_llvm g2(1,2,3)
define %jl_value_t* @julia_g2_22749(%jl_value_t*, %jl_value_t**, i32) {
# … long, with a GC frame and tuple unpacking
This is simply because we're *not* looking at the code for three-argument
`g2`! We're looking at the general variadic case where the number of
arguments is unknown:
julia> @which g2(1,2,3)
g2(a, bs...) at none:1
If, however, we put this into an outer function with a set number of
arguments, we can force julia to show us the code as it would compile it
when the number of arguments is known:
julia> h(a,b,c) = g2(a,b,c)
h (generic function with 2 methods)
julia> @code_llvm h(1,2,3)
define i64 @julia_h_22752(i64, i64, i64) {
top:
%3 = add i64 %1, %0
%4 = add i64 %3, %2
ret i64 %4
}
Finally, benchmarking such tiny things with `@time` can be tricky, which is
why looking at the generated LLVM is very helpful. A few of us are working
on a Benchmarking package that should give much more stable and robust
results (with confidence intervals) at
https://github.com/johnmyleswhite/Benchmarks.jl. I'm delinquent on some
documentation on how the `@benchmark` macro works, but here's a quick
rundown: It only works on function calls, and, unlike `@time`, it evaluates
all the arguments first as a setup step. It benchmarks the function by
calling it repeatedly with the same arguments over and over until it has a
good estimate for the performance, taking care to ensure that LLVM doesn't
optimize the benchmarking loop away or do unwanted constant-folding
operations.
Matt
On Wednesday, November 11, 2015 at 9:40:28 AM UTC-5, Stefan Karpinski wrote:
>
> On Wed, Nov 11, 2015 at 1:29 AM, 'Greg Plowman' via julia-users <
> [email protected] <javascript:>> wrote:
>
>> I have some naïve and probably stupid questions about writing efficient
>> code in Julia.
>> I almost didn't post this because I'm not sure if this is the right place
>> for asking for this type of conceptual help. I do hope it is appropriate. I
>> apologise in advance if it's not.
>>
>
> This is exactly the right place.
>
> *Splatting/Slurping*
>> I have read that splatting/slurping incurs a penalty.
>> Is it the splatting or the slurping or both?
>> I have also read that this is not the case if the function is inlined. Is
>> this true?
>>
>
> Splatting/slurping is generally free for small, predictably-sized tuples.
> For example, doing f(a, (b,c)...) will be exactly equivalent to writing out
> f(a, b, c). For example:
>
> julia> g(a, b, c) = +(a, (b, c)...)
> g (generic function with 1 method)
>
> julia> g(1, 2, 3)
> 6
>
> julia> @code_llvm g(1, 2, 3)
>
> define i64 @julia_g_22684(i64, i64, i64) #0 {
> top:
> %3 = add i64 %1, %0
> %4 = add i64 %3, %2
> ret i64 %4
> }
>
>
> The inlining is irrelevant here: if you try this where g calls a function
> that doesn't get inlined, it will produce the same code whether you do the
> splatting or change the args to (a, b, c). This is an excessively simple
> example since it's very easy to see that +(a, (b, c)...) is equivalent to
> +(a, b, c), but since we specialized function bodies on argument types,
> which often includes the size and component types of tuple arguments, what
> type inference is actually reasoning about is often just this simple.
>
> What do you really want to avoid is splatting dynamically sized
> collections as the arguments. For example, you can sum the values in a
> vector like this:
>
> julia> v = rand(1000);
>
> julia> @time +(v...)
> 0.077058 seconds (39.84 k allocations: 1.930 MB)
> 509.51187339334575
>
> julia> @time +(v...)
> 0.000131 seconds (2.01 k allocations: 71.016 KB)
> 509.51187339334575
>
>
> But this is really bad. Why? Because Julia dispatches function calls on
> all of a functions arguments. In this case there are 1000 arguments, all of
> which, in principle are being used for dispatch. Not good. So if you ever
> find yourself splatting a variably sized collection into a function's
> arguments, stop and think that there must be a reducer that does this. In
> this case, it's the sum function:
>
> julia> @time sum(v)
> 0.000005 seconds (5 allocations: 176 bytes)
> 509.5118733933458
>
>
> As a bonus, our sum function is also more accurate since it uses a better
> summation algorithm than the naive left-to-right accumulation approach.
>
> *Inlining*
>> How do I know if a function is inlined?
>> When is it necessary to explicitly inline? (say with @inline, or
>> Exrp(:meta, :inline))
>> Does this guarantee inlining, or is it just a hint to the compiler?
>> Is the compiler usually smart enough to inline optimally.
>> Why wouldn't I explicitly inline all my small functions?
>>
>
> You can tell if something was inlined by looking at the body of the
> calling method and seeing if there's an explicit method calls or not. For
> example the output of @code_llvm g(1, 2, 3) above, there are no calls to
> any + methods – the addition operations code fully inlined. If, on the
> other hand, you look at @code_llvm g(big(1), big(2), big(3)), you'll see
> that there are explicit calls to Julia methods with names
> like @"julia_+_22861".
>
> You generally don't want or need to mess with this. The compiler should be
> and generally is good at figuring out what should or shouldn't be inlined.
> However, there are some times when a very small method really ought to
> always be inlined, in which case, you can annotated the method definition
> with @inline. You can't always force inlining – because it's not always
> possible – but when it is, this skips the heuristic checks that normally
> happen and just does it. Beware, this can make things slower and can bloat
> generated code.
>
>
>> *Overhead of accessing tuples compared to separate variables/arguments*
>> Is there overhead in accessing tuples vs separate arguments?
>> Is the expression assigned to x slower than expression assigned to y?
>> I = (1,2,3)
>> i1 = 1, i2 = 2, i3 = 3
>> x = I[1]+I[2]+I[3]
>> y = i1+i2+i3
>>
>
> In general no, there is no overhead. Caveats, the tuples must be
> reasonably sized (if they are thousands of elements long, this will not be
> fast), and you should try not to do something very confusing. Type
> inference specializes function bodies on arguments, but if it gets called
> with very long tuples or lots of different tuple types, it will bail out
> and just use generic code, which will be slower.
>
>
>> *"Chained" function calls *(not sure if chained is the right word)It
>> seems sometimes I have lots of small "convenience" functions, effectively
>> chained until a final "working" function is called.
>> A calls B calls C calls D calls ... calls Z. If A, B, C, ... are small
>> (maybe getters, convenience functions etc), is this still efficient?
>> Is there any penalty for this?
>> How does this relate to inlining?
>> Presumably there is no penalty if and only if functions are inlined? Is
>> this true?
>> If so, is there a limit to the number of levels (functions in call chain)
>> that can be inlined?
>> Should I be worried about any of this?
>>
>
> Function chaining is cheap. If inlining occurs, it's free. Even if
> inlining doesn't occur, function calls are very cheap on modern CPUs.
> Calling functions can be cheaper than inlining, so one is not uniformly
> better than the other. Function calls are only expensive if dispatch is
> unpredictable and generic dispatch ends up happening. You can tell this by
> looking at the LLVM code and looking for jl_apply_generic – if that's
> being called, that's generic dispatch. If that's happening there may also
> be type instabilities and other badness.
>