I'm looking into doing better sorting for Julia. [Anyone tried bypassing for better sort performance (in other languages) or for other things? Is Quicksort in Julia the default for floating point, because a stable (merge-sort, otherwise the default) in not judged important? I might look into sorting floating point also, should it be a priority (seems HPC is more about matrix multiply than sorting..). Any pointers on best sorting, I'm looking into samplesort and radix sort for Unicode aware strings and/or integers/floating point. Has bypassing been tried and judged not important?]
For big arrays, the regular loads and stores might be bad. In case say loading and bypassing, is not possible on an x86, then please let me know.. I just remember and confirmed from the Itanium: http://www.realworldtech.com/mckinley/6/ "To keep the data cache as simple and fast as possible the McKinley will likely stick with the Itanium’s choice of directing FP loads and stores directly to the L2 cache" I vaguely remember instructions to bypass more levels. In the Julia manual, I just see unrelated (llvmcall, nothing more, can I control LLVM instructions generated (not even sure they support this..); unsafe_load and unsafe_store!) Ideally I would like this to be portable (both to new CPUs and operating systems..), bypassing, and if the processor doesn't support regular loads and stores generated. http://lwn.net/Articles/255364/ 6.1 Bypassing the Cache When data is produced and not (immediately) consumed again, the fact that memory store operations read a full cache line first and then modify the cached data is detrimental to performance. This operation pushes data out of the caches which might be needed again in favor of data which will not be used soon. [..] For the x86 and x86-64 architectures a number of intrinsics are provided by gcc: #include <emmintrin.h> void _mm_stream_si32(int *p, int a); void _mm_stream_si128(int *p, __m128i a); void _mm_stream_pd(double *p, __m128d a); #include <xmmintrin.h> void _mm_stream_pi(__m64 *p, __m64 a); void _mm_stream_ps(float *p, __m128 a); #include <ammintrin.h> void _mm_stream_sd(double *p, __m128d a); void _mm_stream_ss(float *p, __m128 a); Would something like this be inlined (if not then might not be fast enough?)? Am I on the right path looking into this or is there a better Julia way? -- Palli.
