Pull request as requested is at https://github.com/ispc/ispc/pull/1227. My 
thanks to my employer for sponsoring the improvement.

Thanks for the help and the product guys. For the same effort of hand 
porting all our SSE and AVX intrinsic code to ARM NEON we got an ISPC port 
instead which supports everything we need now and into the future. In case 
anyone is interested, we actually have a Python script call ISPC both on 
Windows and via the Linux Subsystem for Windows to generate assembler files 
for ARM NEON float x4, SSE2 float x4, AVX float x8 and AVX2 float x16 for 
the calling conventions x86, x64-msvc and x64-sysv (the need to call ISPC 
under Windows vs ISPC under Linux is to get both the x64-msvc and x64-sysv 
calling convention output). Those assembler files are then parsed by the 
Python script, translating the armhf output ISPC generates into armel for 
Android ARM and doing a few other hand tweaks, and are committed directly 
to source control as they change rarely. During build, the AT&T format 
assembler files are compiled as normal by cmake for Linux/BSD/OS X/Android 
but on Windows we abuse the Mingw-w64 GNU as assembler to make it generate 
a MSVC compatible .obj file from the AT&T assembler 
but-in-msvc-calling-convention files output by ISPC. That is then linked in 
by Visual Studio as per normal. Believe it or not, it all works a treat. 
It's been a very successful risk we took in choosing ISPC to generate 
assembler instead of doing it by hand, and I'm sad to say we are done with 
optimisation now and moving on to other topics far removed. Nevertheless 
thanks once again, and you may like to know the only reason we heard of 
your work is because I sit on the Programme Committee for CppCon where this 
(now accepted) talk 
https://cppcon2016.sched.org/event/7nKw/spmd-programming-using-c-and-ispc 
was one assigned to me for review. So many thanks to that student for 
bringing ISPC to our attention!

Niall


On Friday, September 2, 2016 at 3:31:25 PM UTC+1, Niall Douglas wrote:
>
>
>>> It does seem very odd that LLVM wouldn't automatically inline a function 
>> consisting of a single instruction.
>>
>
> I've discovered through trial and error it is the lack of the "readnone" 
> modifier which causes LLVM to not inline the function. After looking up 
> that modifier I can see why that would be the case, and indeed why the lack 
> of that modifier would penalise optimisation of ARM NEON generated because 
> LLVM will assume every such function not so marked will change outcomes if 
> global memory state could have been changed. In particular, it would 
> severely restrict the reordering of instructions LLVM could do.
>
> Quite a few of the ARM NEON builtins are missing "readnone". None that I 
> can see of the AVX builtins is missing it. I am surprised this problem 
> hasn't been raised before, it's very obvious from the assembler output.
>  
>
>>
>> I've asked my employer for the time to send a pull request. If it's 
>> granted, happy to oblige.
>>
>> I've been allowed this time by my employer who wishes to remain 
> anonymous. I'll issue a pull request next week which applies nounwind 
> readnone alwaysinline to everything in the NEON builtins, using the AVX 
> builtins as a guide. I should think this will improve the optimisation 
> quality of the NEON output quite a bit wherever it uses the builtins.
>
> Niall
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Intel SPMD Program Compiler Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to