I know numpy has multi-threading in a bunch of their vectorized functions.

On Tue, Aug 26, 2014 at 11:46 AM, Iain Dunning <[email protected]>
wrote:

> Is it possible that PyPy is multithreading something? I struggle to
> believe its so much faster - although I'd believe it if it was about the
> same.
>  On Aug 26, 2014 11:31 AM, "Simon Kornblith" <[email protected]> wrote:
>
>> Here's an interesting comparison:
>>
>> https://gist.github.com/simonster/6195af68c6df33ca965d
>>
>> idiv for 64-bit integers is one of the most expensive extant x86-64
>> instructions. For 32-bit integers, it is much cheaper, and this function
>> runs nearly twice as fast. When LLVM knows the divisor in advance, it can
>> avoid idiv and perform the division as a multiplication and bit shift,
>> which is faster still. But the biggest gain comes from avoiding rem in the
>> inner loop entirely.
>>
>> Of course, this is only tackling the odd case, and additional performance
>> can be gained for all of these benchmarks by avoiding allocation.
>>
>> Simon
>>
>> On Tuesday, August 26, 2014 9:11:47 AM UTC-4, Phillip Berndt wrote:
>>>
>>>
>>> (1) allocate the output M outside of the core algorithm, and pass it as
>>>> an
>>>> input, i.e.,
>>>>
>>>
>>> I did that, though it can be argued that this is cheating given that the
>>> competitors also have to allocate an array for each loop. With that version
>>> (and some more slight optimization: Storing intermediate values in the for
>>> loops, using column-major indexing and @simd) [https://gist.github.com/
>>> phillipberndt/7dc0aed7eb855f900f0d/21cce76664bdc59f6203ff6f3496e8
>>> 0e256f54cb], the overall time for the N=3..1000 test case is down to
>>> 3.67s.
>>>
>>> (2) @time (for i = 1:100; magic!(M); end). Did it allocate any memory?
>>>> Then
>>>> you have a problem. Use the profiler, or run julia with --track-
>>>> allocation=user, to find out where it occurs.
>>>
>>>
>>> It does, about 3 Mb on line 2 (if n % 2 == 1). Doesn't make much sense
>>> so I guess the profiler interfered with the optimizer here?! I doubt that
>>> trying to get rid of the 3Mb will gain another second though.
>>>
>>>
>>>> (3) Even if it's not allocating, you may have a bottleneck. Use the
>>>> profiler to
>>>> find it.
>>>>
>>>
>>> The line where the most time is spent is line 11, filling the array in
>>> the odd case. I don't see how it could be optimized any further, so that's
>>> probably as far as one gets?!
>>>
>>> - Phillip
>>>
>>

Reply via email to