2016-02-02 21:44 GMT+00:00 Stefan Karpinski <[email protected]>:

> On Tue, Feb 2, 2016 at 4:22 PM, Páll Haraldsson <[email protected]
> > wrote:
>
>>
>> I'm looking into doing better sorting for Julia.
>>
>> [Anyone tried bypassing for better sort performance (in other languages)
>> or for other things? Is Quicksort in Julia the default for floating point,
>> because a stable (merge-sort, otherwise the default) in not judged
>> important?
>>
>
> Stability is pointless and inefficient for types for which equal values
> are indistinguishable.
>

Yes of course. Maybe I assumed wrongly that if the first part of the key is
a floating-point number, then Quicksort is used..

Anyway, is floating-point sorting considered very important (this is just
an intellectual exercise for me and I have to make up my mind where to
start, if it's more important than something else, suhc as [Uniciode]
strings), up to what kind of n.


>
>
>> I might look into sorting floating point also, should it be a priority
>> (seems HPC is more about matrix multiply than sorting..). Any pointers on
>> best sorting, I'm looking into samplesort and radix sort for Unicode aware
>> strings and/or integers/floating point.
>>
>
> Radix sort for ints and floats is implemented in the SortingAlgorithms
> <https://github.com/JuliaLang/SortingAlgorithms.jl> package and can be
> faster.
>

I know. I'm trying out for my own algorithm, a kind of hybrid..


> It would be good to know under which circumstances it is faster or slower.
> I don't think radix sort has been applied to strings and I'm not sure if it
> would be faster.
>

Yes, possibly not.


>
>> Has bypassing been tried and judged not important?
>>
>
> I don't know what this means.
>
> For big arrays, the regular loads and stores might be bad. In case say
>> loading and bypassing, is not possible on an x86, then please let me know..
>> I just remember and confirmed from the Itanium:
>>
>>
>> http://www.realworldtech.com/mckinley/6/
>> "To keep the data cache as simple and fast as possible the McKinley will
>> likely stick with the Itanium’s choice of directing FP loads and stores
>> directly to the L2 cache"
>>
>> I vaguely remember instructions to bypass more levels.
>>
>>
>> In the Julia manual, I just see unrelated (llvmcall, nothing more, can I
>> control LLVM instructions generated (not even sure they support
>> this..); unsafe_load and unsafe_store!)
>>
>> Ideally I would like this to be portable (both to new CPUs and operating
>> systems..), bypassing, and if the processor doesn't support regular loads
>> and stores generated.
>>
>>
>>
>> http://lwn.net/Articles/255364/
>> 6.1 Bypassing the Cache
>>
>> When data is produced and not (immediately) consumed again, the fact that
>> memory store operations read a full cache line first and then modify the
>> cached data is detrimental to performance. This operation pushes data out
>> of the caches which might be needed again in favor of data which will not
>> be used soon.
>>
>> [..]
>>
>> For the x86 and x86-64 architectures a number of intrinsics are provided
>> by gcc:
>>
>> #include <emmintrin.h>
>> void _mm_stream_si32(int *p, int a);
>> void _mm_stream_si128(int *p, __m128i a);
>> void _mm_stream_pd(double *p, __m128d a);
>>
>> #include <xmmintrin.h>
>> void _mm_stream_pi(__m64 *p, __m64 a);
>> void _mm_stream_ps(float *p, __m128 a);
>>
>> #include <ammintrin.h>
>> void _mm_stream_sd(double *p, __m128d a);
>> void _mm_stream_ss(float *p, __m128 a);
>>
>>
>> Would something like this be inlined (if not then might not be fast
>> enough?)? Am I on the right path looking into this or is there a better
>> Julia way?
>>
>
> Intrinsics are not real functions – they are constructs that look like
> functions syntactically that gcc knows how to implement. I don't think LLVM
> exposes a mechanism for using any of these.
>

Thanks, any idea if I could do it differently? Inject LLVM bitcode?

-- 
Palli.

Reply via email to