Ignoring pipelining and modern CPU trickery, the CPU instruction count is a 
potentially low variance lower bound on the speed of an operation. The lower 
bound property is all I meant by purity. In addition, because instruction 
counts are often lower variance than other metrics, it's potentially useful for 
comparing two proposed implementations/algorithms. Other metrics like mean wall 
clock time may have such high variance that you can't collect enough 
measurements to conclude that a change had any impact.

Personally, I think you can never do better that an estimate of the full 
distribution of wall clock timings. But you almost never collect enough data to 
even come close to having a proper estimate of this distribution. In that 
regard, I totally agree that there is no perfect measurement. But you do need 
to make decisions. Lower bounds from CPU instructions are nice for doing that, 
although you want to make sure your wall clock timings aren't doing something 
nuts to create the change in CPU instructions.

 -- John

On Jun 4, 2014, at 9:36 AM, Tamas Papp <[email protected]> wrote:

> Sorry if I missed something in this discussion, but why are you assuming
> that the CPU instructions per se are the bottleneck, instead of, say,
> memory, especially how your algorithm interacts with various levels of
> CPU caches? Heavily optimized numerical algorithms make a lot of effort
> to optimize the latter, as fetching something from memory is on the
> order of ~100 CPU instructions.
> 
> I am not a computer scientist, but I am not sure that such a
> theoretically pure measurement exists. If there is a lot of variation in
> execution times _with the same input_, I guess it would be to worthwhile
> to understand the cause. GC would be a prime suspect in dynamic
> languages, but I don't yet know Julia well enough to see if this
> generalizes to your case.
> 
> Best,
> 
> Tamas
> 
> On Wed, Jun 04 2014, Joshua Job <[email protected]> wrote:
> 
>> I want a theoretically pure measurement for an algorithm. I'm not sure how 
>> I would go about counting CPU instructions, however, since the algorithm is 
>> only a heuristic one and my only guide for how long it takes on a given set 
>> of instances is to run it and see how long it took. Is there some way of 
>> getting a count of CPU instructions due to the execution of a particular 
>> function in Julia? 
>> 
>> I want to estimate how long a user would have to run the algorithm in order 
>> to achieve 99% confidence that it will have succeeded at last once in 
>> finding the true solution to the problem(s). The obvious way to do that is 
>> to run the algorithm many times (say, 10^4) and take the 99th percentile of 
>> that distribution and call it T. If one ran the algorithm for time T, then 
>> we know empirically that 99 times out of a 100 it will find a solution in a 
>> time less than or equal to T.
>> 
>> I could use means or medians of course, but those are much less sensitive 
>> to the tails of the distribution, and I'm really interested in how long it 
>> takes to achieve confidence in the results, not just average time to 
>> solution. Hence the desire to time the runs of the algorithms.
>> 
>> In any case I may have underestimated the runtime of the algorithm in 
>> question --- it appears for interesting problem sizes the C code for it 
>> takes on the order of a millisecond or longer as opposed to tens of 
>> microseconds. I suspect then that the @time macro or tic/toc should be 
>> accurate enough for that purpose.
>> 
>> On Tuesday, June 3, 2014 7:36:12 PM UTC-7, John Myles White wrote:
>>> 
>>> I’m not sure there’s any single correct way to do benchmarks without 
>>> information about what you’re trying to optimize.
>>> 
>>> If you’re trying to optimize the experience of people using your code, I 
>>> think it’s important to use means rather than medians because you want to 
>>> use a metric that’s effected by the entire shape of the distribution of 
>>> times and not entirely determined by the "center" of that distribution.
>>> 
>>> If you want a theoretically pure measurement for an algorithm, I think 
>>> measuring time is kind of problematic. For algorithms, I’d prefer seeing a 
>>> count of CPU instructions.
>>> 
>>> — John
>>> 
>>> On Jun 2, 2014, at 7:32 PM, Kevin Squire <[email protected] 
>>> <javascript:>> wrote:
>>> 
>>> I think that, for many algorithms, triggering the gc() is simply a matter 
>>> of running a simulation for enough iterations. By calling gc() ahead of 
>>> time, you should be able to get the same number (n > 0) of gc calls, which 
>>> isn't ignoring gc().  
>>> 
>>> That said, it can take some effort to figure out the number of iterations 
>>> and time to run the experiment. 
>>> 
>>> Cheers, Kevin 
>>> 
>>> On Monday, June 2, 2014, Stefan Karpinski <[email protected] 
>>> <javascript:>> wrote:
>>> 
>>>> I feel that ignoring gc can be a bit of a cheat since it does happen and 
>>>> it's quite expensive – and other systems may be better or worse at it. Of 
>>>> course, it can still be good to separate the cause of slowness explicitly 
>>>> into execution time and overhead for things like gc.
>>>> 
>>>> 
>>>> On Mon, Jun 2, 2014 at 5:21 PM, Kevin Squire <[email protected]> 
>>>> wrote:
>>>> 
>>>>> Thanks John.  His argument definitely makes sense (that algorithms that 
>>>>> cause more garbage collection won't get penalized by median, unless, of 
>>>>> course, they cause gc() to occur more than 50% of the time).  
>>>>> 
>>>>> Most benchmarks of Julia code that I've done (or seen) have made some 
>>>>> attempt to take gc() differences out of the equation, usually by 
>>>>> explicitly 
>>>>> calling gc() before the timing begins.  For most algorithms, that would 
>>>>> mean that the same number of gc() calls should occur for each repetition, 
>>>>> in which case, I would think that any measure of central tendency 
>>>>> (including mean and median) would be useful.  
>>>>> 
>>>>> Is there a problem with this reasoning?
>>>>> 
>>>>> Cheers,
>>>>>   Kevin
>>>>> 
>>>>> 
>>>>> On Mon, Jun 2, 2014 at 1:04 PM, John Myles White <
>>>>> [email protected]> wrote:
>>>>> 
>>>>>> For some reasons why one might want not to use the median, see 
>>>>>> http://radfordneal.wordpress.com/2014/02/02/inaccurate-results-from-microbenchmark/
>>>>>> 
>>>>>> -- John
>>>>>> 
>>>>>> On Jun 2, 2014, at 11:06 AM, Kevin Squire <[email protected]> 
>>>>>> wrote:
>>>>>> 
>>>>>> median is probably also useful. I like it a little better in cases 
>>>>>> where the code being tested triggers gc() more than half the time. 
>>>>>> 
>>>>>> On Monday, June 2, 2014, Steven G. Johnson <[email protected]> 
>>>>>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Monday, June 2, 2014 1:01:25 AM UTC-4, Jameson wrote: 
>>>>>>>> 
>>>>>>>> Therefore, for benchmarks, you should execute your code in a loop 
>>>>>>>> enough times that the measurement error (of the hardware and OS) is 
>>>>>>>> not too 
>>>>>>>> significant.
>>>>>>>> 
>>>>>>> 
>>>>>>> You can also often benchmark multiple times and take the minimum (not 
>>>>>>> the mean!) time for reasonable results with fairly small time intervals.
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
> 
> --

Reply via email to