On Tue, Feb 2, 2016 at 4:22 PM, Páll Haraldsson <[email protected]>
wrote:

>
> I'm looking into doing better sorting for Julia.
>
> [Anyone tried bypassing for better sort performance (in other languages)
> or for other things? Is Quicksort in Julia the default for floating point,
> because a stable (merge-sort, otherwise the default) in not judged
> important?
>

Stability is pointless and inefficient for types for which equal values are
indistinguishable.


> I might look into sorting floating point also, should it be a priority
> (seems HPC is more about matrix multiply than sorting..). Any pointers on
> best sorting, I'm looking into samplesort and radix sort for Unicode aware
> strings and/or integers/floating point.
>

Radix sort for ints and floats is implemented in the SortingAlgorithms
<https://github.com/JuliaLang/SortingAlgorithms.jl> package and can be
faster. It would be good to know under which circumstances it is faster or
slower. I don't think radix sort has been applied to strings and I'm not
sure if it would be faster.


> Has bypassing been tried and judged not important?
>

I don't know what this means.

For big arrays, the regular loads and stores might be bad. In case say
> loading and bypassing, is not possible on an x86, then please let me know..
> I just remember and confirmed from the Itanium:
>
>
> http://www.realworldtech.com/mckinley/6/
> "To keep the data cache as simple and fast as possible the McKinley will
> likely stick with the Itanium’s choice of directing FP loads and stores
> directly to the L2 cache"
>
> I vaguely remember instructions to bypass more levels.
>
>
> In the Julia manual, I just see unrelated (llvmcall, nothing more, can I
> control LLVM instructions generated (not even sure they support
> this..); unsafe_load and unsafe_store!)
>
> Ideally I would like this to be portable (both to new CPUs and operating
> systems..), bypassing, and if the processor doesn't support regular loads
> and stores generated.
>
>
>
> http://lwn.net/Articles/255364/
> 6.1 Bypassing the Cache
>
> When data is produced and not (immediately) consumed again, the fact that
> memory store operations read a full cache line first and then modify the
> cached data is detrimental to performance. This operation pushes data out
> of the caches which might be needed again in favor of data which will not
> be used soon.
>
> [..]
>
> For the x86 and x86-64 architectures a number of intrinsics are provided
> by gcc:
>
> #include <emmintrin.h>
> void _mm_stream_si32(int *p, int a);
> void _mm_stream_si128(int *p, __m128i a);
> void _mm_stream_pd(double *p, __m128d a);
>
> #include <xmmintrin.h>
> void _mm_stream_pi(__m64 *p, __m64 a);
> void _mm_stream_ps(float *p, __m128 a);
>
> #include <ammintrin.h>
> void _mm_stream_sd(double *p, __m128d a);
> void _mm_stream_ss(float *p, __m128 a);
>
>
> Would something like this be inlined (if not then might not be fast
> enough?)? Am I on the right path looking into this or is there a better
> Julia way?
>

Intrinsics are not real functions – they are constructs that look like
functions syntactically that gcc knows how to implement. I don't think LLVM
exposes a mechanism for using any of these.

Reply via email to