2016-02-02 21:44 GMT+00:00 Stefan Karpinski <[email protected]>: > On Tue, Feb 2, 2016 at 4:22 PM, Páll Haraldsson <[email protected] > > wrote: > >> >> I'm looking into doing better sorting for Julia. >> >> [Anyone tried bypassing for better sort performance (in other languages) >> or for other things? Is Quicksort in Julia the default for floating point, >> because a stable (merge-sort, otherwise the default) in not judged >> important? >> > > Stability is pointless and inefficient for types for which equal values > are indistinguishable. >
Yes of course. Maybe I assumed wrongly that if the first part of the key is a floating-point number, then Quicksort is used.. Anyway, is floating-point sorting considered very important (this is just an intellectual exercise for me and I have to make up my mind where to start, if it's more important than something else, suhc as [Uniciode] strings), up to what kind of n. > > >> I might look into sorting floating point also, should it be a priority >> (seems HPC is more about matrix multiply than sorting..). Any pointers on >> best sorting, I'm looking into samplesort and radix sort for Unicode aware >> strings and/or integers/floating point. >> > > Radix sort for ints and floats is implemented in the SortingAlgorithms > <https://github.com/JuliaLang/SortingAlgorithms.jl> package and can be > faster. > I know. I'm trying out for my own algorithm, a kind of hybrid.. > It would be good to know under which circumstances it is faster or slower. > I don't think radix sort has been applied to strings and I'm not sure if it > would be faster. > Yes, possibly not. > >> Has bypassing been tried and judged not important? >> > > I don't know what this means. > > For big arrays, the regular loads and stores might be bad. In case say >> loading and bypassing, is not possible on an x86, then please let me know.. >> I just remember and confirmed from the Itanium: >> >> >> http://www.realworldtech.com/mckinley/6/ >> "To keep the data cache as simple and fast as possible the McKinley will >> likely stick with the Itanium’s choice of directing FP loads and stores >> directly to the L2 cache" >> >> I vaguely remember instructions to bypass more levels. >> >> >> In the Julia manual, I just see unrelated (llvmcall, nothing more, can I >> control LLVM instructions generated (not even sure they support >> this..); unsafe_load and unsafe_store!) >> >> Ideally I would like this to be portable (both to new CPUs and operating >> systems..), bypassing, and if the processor doesn't support regular loads >> and stores generated. >> >> >> >> http://lwn.net/Articles/255364/ >> 6.1 Bypassing the Cache >> >> When data is produced and not (immediately) consumed again, the fact that >> memory store operations read a full cache line first and then modify the >> cached data is detrimental to performance. This operation pushes data out >> of the caches which might be needed again in favor of data which will not >> be used soon. >> >> [..] >> >> For the x86 and x86-64 architectures a number of intrinsics are provided >> by gcc: >> >> #include <emmintrin.h> >> void _mm_stream_si32(int *p, int a); >> void _mm_stream_si128(int *p, __m128i a); >> void _mm_stream_pd(double *p, __m128d a); >> >> #include <xmmintrin.h> >> void _mm_stream_pi(__m64 *p, __m64 a); >> void _mm_stream_ps(float *p, __m128 a); >> >> #include <ammintrin.h> >> void _mm_stream_sd(double *p, __m128d a); >> void _mm_stream_ss(float *p, __m128 a); >> >> >> Would something like this be inlined (if not then might not be fast >> enough?)? Am I on the right path looking into this or is there a better >> Julia way? >> > > Intrinsics are not real functions – they are constructs that look like > functions syntactically that gcc knows how to implement. I don't think LLVM > exposes a mechanism for using any of these. > Thanks, any idea if I could do it differently? Inject LLVM bitcode? -- Palli.
