I’m not sure there’s any single correct way to do benchmarks without 
information about what you’re trying to optimize.

If you’re trying to optimize the experience of people using your code, I think 
it’s important to use means rather than medians because you want to use a 
metric that’s effected by the entire shape of the distribution of times and not 
entirely determined by the "center" of that distribution.

If you want a theoretically pure measurement for an algorithm, I think 
measuring time is kind of problematic. For algorithms, I’d prefer seeing a 
count of CPU instructions.

 — John

On Jun 2, 2014, at 7:32 PM, Kevin Squire <[email protected]> wrote:

> I think that, for many algorithms, triggering the gc() is simply a matter of 
> running a simulation for enough iterations. By calling gc() ahead of time, 
> you should be able to get the same number (n > 0) of gc calls, which isn't 
> ignoring gc().  
> 
> That said, it can take some effort to figure out the number of iterations and 
> time to run the experiment. 
> 
> Cheers, Kevin 
> 
> On Monday, June 2, 2014, Stefan Karpinski <[email protected]> wrote:
> I feel that ignoring gc can be a bit of a cheat since it does happen and it's 
> quite expensive – and other systems may be better or worse at it. Of course, 
> it can still be good to separate the cause of slowness explicitly into 
> execution time and overhead for things like gc.
> 
> 
> On Mon, Jun 2, 2014 at 5:21 PM, Kevin Squire <[email protected]> wrote:
> Thanks John.  His argument definitely makes sense (that algorithms that cause 
> more garbage collection won't get penalized by median, unless, of course, 
> they cause gc() to occur more than 50% of the time).  
> 
> Most benchmarks of Julia code that I've done (or seen) have made some attempt 
> to take gc() differences out of the equation, usually by explicitly calling 
> gc() before the timing begins.  For most algorithms, that would mean that the 
> same number of gc() calls should occur for each repetition, in which case, I 
> would think that any measure of central tendency (including mean and median) 
> would be useful.  
> 
> Is there a problem with this reasoning?
> 
> Cheers,
>    Kevin
> 
> 
> On Mon, Jun 2, 2014 at 1:04 PM, John Myles White <[email protected]> 
> wrote:
> For some reasons why one might want not to use the median, see 
> http://radfordneal.wordpress.com/2014/02/02/inaccurate-results-from-microbenchmark/
> 
>  -- John
> 
> On Jun 2, 2014, at 11:06 AM, Kevin Squire <[email protected]> wrote:
> 
>> median is probably also useful. I like it a little better in cases where the 
>> code being tested triggers gc() more than half the time. 
>> 
>> On Monday, June 2, 2014, Steven G. Johnson <[email protected]> wrote:
>> 
>> 
>> On Monday, June 2, 2014 1:01:25 AM UTC-4, Jameson wrote: 
>> Therefore, for benchmarks, you should execute your code in a loop enough 
>> times that the measurement error (of the hardware and OS) is not too 
>> significant.
>> 
>> 
>> You can also often benchmark multiple times and take the minimum (not the 
>> mean!) time for reasonable results with fairly small time intervals.
> 
> 
> 

Reply via email to