On Tue, Feb 2, 2016 at 4:22 PM, Páll Haraldsson <[email protected]> wrote:
> > I'm looking into doing better sorting for Julia. > > [Anyone tried bypassing for better sort performance (in other languages) > or for other things? Is Quicksort in Julia the default for floating point, > because a stable (merge-sort, otherwise the default) in not judged > important? > Stability is pointless and inefficient for types for which equal values are indistinguishable. > I might look into sorting floating point also, should it be a priority > (seems HPC is more about matrix multiply than sorting..). Any pointers on > best sorting, I'm looking into samplesort and radix sort for Unicode aware > strings and/or integers/floating point. > Radix sort for ints and floats is implemented in the SortingAlgorithms <https://github.com/JuliaLang/SortingAlgorithms.jl> package and can be faster. It would be good to know under which circumstances it is faster or slower. I don't think radix sort has been applied to strings and I'm not sure if it would be faster. > Has bypassing been tried and judged not important? > I don't know what this means. For big arrays, the regular loads and stores might be bad. In case say > loading and bypassing, is not possible on an x86, then please let me know.. > I just remember and confirmed from the Itanium: > > > http://www.realworldtech.com/mckinley/6/ > "To keep the data cache as simple and fast as possible the McKinley will > likely stick with the Itanium’s choice of directing FP loads and stores > directly to the L2 cache" > > I vaguely remember instructions to bypass more levels. > > > In the Julia manual, I just see unrelated (llvmcall, nothing more, can I > control LLVM instructions generated (not even sure they support > this..); unsafe_load and unsafe_store!) > > Ideally I would like this to be portable (both to new CPUs and operating > systems..), bypassing, and if the processor doesn't support regular loads > and stores generated. > > > > http://lwn.net/Articles/255364/ > 6.1 Bypassing the Cache > > When data is produced and not (immediately) consumed again, the fact that > memory store operations read a full cache line first and then modify the > cached data is detrimental to performance. This operation pushes data out > of the caches which might be needed again in favor of data which will not > be used soon. > > [..] > > For the x86 and x86-64 architectures a number of intrinsics are provided > by gcc: > > #include <emmintrin.h> > void _mm_stream_si32(int *p, int a); > void _mm_stream_si128(int *p, __m128i a); > void _mm_stream_pd(double *p, __m128d a); > > #include <xmmintrin.h> > void _mm_stream_pi(__m64 *p, __m64 a); > void _mm_stream_ps(float *p, __m128 a); > > #include <ammintrin.h> > void _mm_stream_sd(double *p, __m128d a); > void _mm_stream_ss(float *p, __m128 a); > > > Would something like this be inlined (if not then might not be fast > enough?)? Am I on the right path looking into this or is there a better > Julia way? > Intrinsics are not real functions – they are constructs that look like functions syntactically that gcc knows how to implement. I don't think LLVM exposes a mechanism for using any of these.
