I like assembler, and I do use SIMD intrinsincs in some of my code (not R), but sparingly.

The issue is more than portability between platforms, but also portability between processors - if you write your optimized code using AVX, it might not take advantage of newer AVX512 cpus.

In many cases your compiler will do the right thing and optimize your code.

I suggest:

* write your code in plain C, test it with some long computation and use "perf top" on Linux to observe the code hotspots and which assembler instructions are being used.

* if you see instructions like "addps" these are vectorized. If you see instructions like "addss" these are *not* vectorized.

* if you see a few instructions as hotspots with arguments in parenthesis "vmovaps %xmm1,(%r8)" then you are likely limited by memory access.

* If you are not limited by memory access and the compiler produces a lot of "addss" or similar that are hotspots, then you need to look at your code and make it more parallelizable.

   * How to make your C code more parallelizable:

   You want to make easy to interpret loops like

         for(i=start;i<stop;i++) {

You can help the compiler by using "restrict" keyword to indicate that arrays do not overlap, or (as a sledgehammer) "#pragma ivdep". But before using keywords check with "perf top" which code is actually a hotspot, as the compiler can generate good code without restrict keywords, by using multiple code paths.

* You can create small temporary arrays to make your algorithm look more like loops above. The small arrays should be at least 16 wide, because AVX512 has instructions that operate on 16 floats at a time.

* To allow use of small arrays you can unroll your loops. Note that compilers do unrolling themselves, so doing it manually is only helpful if this makes the inner body of the loop more parallelizable.

* You can debug why the compiler does not parallelize your code by turning on diagnostics. For gcc the flag is "-fopt-info-vec-missed=vec_info.txt"

* In very rare cases you use intrinsics. For me this is typically a situation when I need to find a value and the index of a maximum or minimum in an array - compilers do not optimize this well, at least for many different ways of coding this in C that I have tried many years ago.

* If after all your work you got a factor of 2 speedup you are doing fine. If you want larger speedup change your algorithm.


Vladimir Dergachev

On Wed, 27 Mar 2024, Dirk Eddelbuettel wrote:

On 27 March 2024 at 08:48, jesse koops wrote:
| Thank you, I was not aware of the easy way to search CRAN. I looked at
| rcppsimdjson of course, but couldn't figure it out since it is done in
| the simdjson library if interpret it correclty, not within the R
| ecosystem and I didn't know how that would change things. Writing R
| extensions assumes a lot of  prior knowledge so I will have to work my
| way up to there first.

I think I have (at least) one other package doing something like this _in the
library layer too_ as suggested by Tomas, namely crc32c as used by digest.
You could study how crc32c [0] does this for x86_64 and arm64 to get hardware
optimization. (This may be more specific cpu hardware optimization but at
least the library and cmake files are small.)

I decided as a teenager that assembler wasn't for me and haven't looked back,
but I happily take advantage of it when bundled well. So strong second for
the recommendation by Tomas to rely on this being done in an external and
tested library.

(Another interesting one there is highway [1]. Just packaging that would
likely be an excellent contribution.)


[0] repo: https://github.com/google/crc32c
[1] repo: https://github.com/google/highway
   docs: https://google.github.io/highway/en/master/

| Op di 26 mrt 2024 om 15:41 schreef Dirk Eddelbuettel <e...@debian.org>:
| >
| >
| > On 26 March 2024 at 10:53, jesse koops wrote:
| > | How can I make this portable and CRAN-acceptable?
| >
| > But writing (or borrowing ?) some hardware detection via either configure /
| > autoconf or cmake. This is no different than other tasks decided at 
| >
| > Start with 'Writing R Extensions', as always, and work your way up from
| > there. And if memory serves there are already a few other packages with SIMD
| > at CRAN so you can also try to take advantage of the search for a 'token'
| > (here: 'SIMD') at the (unofficial) CRAN mirror at GitHub:
| >
| >    https://github.com/search?q=org%3Acran%20SIMD&type=code
| >
| > Hth, Dirk
| >
| > --
| > dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

R-package-devel@r-project.org mailing list

R-package-devel@r-project.org mailing list

Reply via email to