I'm looking into doing better sorting for Julia.

[Anyone tried bypassing for better sort performance (in other languages) or 
for other things? Is Quicksort in Julia the default for floating point, 
because a stable (merge-sort, otherwise the default) in not judged 
important? I might look into sorting floating point also, should it be a 
priority (seems HPC is more about matrix multiply than sorting..). Any 
pointers on best sorting, I'm looking into samplesort and radix sort for 
Unicode aware strings and/or integers/floating point. Has bypassing been 
tried and judged not important?]

For big arrays, the regular loads and stores might be bad. In case say 
loading and bypassing, is not possible on an x86, then please let me know.. 
I just remember and confirmed from the Itanium:


http://www.realworldtech.com/mckinley/6/
"To keep the data cache as simple and fast as possible the McKinley will 
likely stick with the Itanium’s choice of directing FP loads and stores 
directly to the L2 cache"

I vaguely remember instructions to bypass more levels.


In the Julia manual, I just see unrelated (llvmcall, nothing more, can I 
control LLVM instructions generated (not even sure they support 
this..); unsafe_load and unsafe_store!)

Ideally I would like this to be portable (both to new CPUs and operating 
systems..), bypassing, and if the processor doesn't support regular loads 
and stores generated.



http://lwn.net/Articles/255364/
6.1 Bypassing the Cache

When data is produced and not (immediately) consumed again, the fact that 
memory store operations read a full cache line first and then modify the 
cached data is detrimental to performance. This operation pushes data out 
of the caches which might be needed again in favor of data which will not 
be used soon.

[..]

For the x86 and x86-64 architectures a number of intrinsics are provided by 
gcc:

#include <emmintrin.h>
void _mm_stream_si32(int *p, int a);
void _mm_stream_si128(int *p, __m128i a);
void _mm_stream_pd(double *p, __m128d a);

#include <xmmintrin.h>
void _mm_stream_pi(__m64 *p, __m64 a);
void _mm_stream_ps(float *p, __m128 a);

#include <ammintrin.h>
void _mm_stream_sd(double *p, __m128d a);
void _mm_stream_ss(float *p, __m128 a);


Would something like this be inlined (if not then might not be fast 
enough?)? Am I on the right path looking into this or is there a better 
Julia way?

-- 
Palli.

Reply via email to