Re: How to use struct without gather/scatter performance warnings

David Nadaski Thu, 02 Apr 2020 13:27:04 -0700

Hi Jeff, I've tested the latest one with streaming store + prefetch and it 
is now at about the same speed as handmade AVX. This is great!


On Wednesday, April 1, 2020 at 4:20:26 PM UTC-5, Jeff Rous wrote:
>
> If you've got a large array of data to process, it can be helpful to 
> prefetch some distance out so it can in the cache when execution gets 
> there. ISPC can do that too.
>
> https://ispc.godbolt.org/z/JTCngb
>
> On Wednesday, April 1, 2020 at 11:17:19 AM UTC-7, Jeff Rous wrote:
>>
>> ISPC can do streaming store like your intrinsics example too! When it's 
>> able to be used it can provide a good speedup.
>>
>> https://ispc.godbolt.org/z/m43bw9
>>
>> On Friday, March 27, 2020 at 11:08:47 AM UTC-7, David Nadaski wrote:
>>>
>>> Here is my performance benchmark with 51200000 vector4:
>>>
>>> explicit SSE2    : 135.575300 ms
>>> ISPC avx2-i32x8  : 131.927800 ms
>>> ISPC avx2-i32x16 : 124.073100 ms
>>> explicit AVX     : 90.134600 ms
>>>
>>> performance improvement:
>>> explicit SSE2    : 100%
>>> ISPC avx2-i32x8  : 103%
>>> ISPC avx2-i32x16 : 108%
>>> explicit AVX     : 133%
>>>
>>> Based on this, my target for ISPC is +33% as seen with explicit AVX.
>>> I can check on other CPUs.
>>>
>>> Regards,
>>> David
>>>
>>> On Friday, March 27, 2020 at 1:52:41 AM UTC-5, Dmitry Babokin wrote:
>>>>
>>>> David,
>>>>
>>>> Have you managed to measure ISPC speed up? Also you mentioned i4770k, 
>>>> it's 4th gen Core (Haswell), first one to support AVX2. Would be 
>>>> interesting to see if Skylake and later (6th gen) have better speed ups - 
>>>> some aspects of AVX2 support were substantially improved.
>>>>
>>>> Dmitry.
>>>>
>>>> On Thu, Mar 26, 2020 at 8:31 PM David Nadaski <[email protected]> 
>>>> wrote:
>>>>
>>>>> Sorry for the late response, I was trying to put a proof of concept 
>>>>> explicit AVX2 implementation together (and got someone really 
>>>>> knowledgeable 
>>>>> with intrinsics to help).
>>>>> The following performs 30% faster than my original SSE2 implementation 
>>>>> (on an i4770k), so I would expect around this much speed up w/ ISPC 
>>>>> against 
>>>>> the AVX2 ISA.
>>>>>
>>>>> And once again I wanted to thank everyone who's been helping along, 
>>>>> you guys are great.
>>>>>
>>>>> void mul_const_mat_avx_otherMajor(std::vector<vec4>& result, const 
>>>>> std::vector<vec4>& vecs, const mat4& m)
>>>>> {
>>>>> __m128 c0 = _mm_load_ps(&m.data[0]);
>>>>> __m128 c1 = _mm_load_ps(&m.data[4]);
>>>>> __m128 c2 = _mm_load_ps(&m.data[8]);
>>>>> __m128 c3 = _mm_load_ps(&m.data[12]);
>>>>>
>>>>> __m256 c01 = _mm256_set_m128(c0, c1);
>>>>> __m256 c23 = _mm256_set_m128(c2, c3);
>>>>>
>>>>> const __m256i index1 = _mm256_set_epi32(0, 0, 0, 0, 1, 1, 1, 1);
>>>>> const __m256i index2 = _mm256_set_epi32(2, 2, 2, 2, 3, 3, 3, 3);
>>>>>
>>>>>
>>>>> for (size_t i = 0; i < vecs.size(); i++)
>>>>> {
>>>>> __m128 vec128 = _mm_load_ps(&vecs[i].data[0]);
>>>>> __m256 vec256 = _mm256_castps128_ps256(vec128);
>>>>> vec256 = _mm256_insertf128_ps(vec256, vec128, 1);
>>>>>
>>>>> __m256 xy = _mm256_permutevar_ps(vec256, index1);
>>>>> __m256 zw = _mm256_permutevar_ps(vec256, index2);
>>>>>
>>>>> __m256 res_a = _mm256_mul_ps(xy, c01);
>>>>> __m256 res_b = _mm256_mul_ps(zw, c23);
>>>>>
>>>>> __m256 res_c = _mm256_add_ps(res_a, res_b);
>>>>>
>>>>> __m128 res_upper = _mm256_extractf128_ps(res_c, 1);
>>>>> __m128 res_lower = _mm256_extractf128_ps(res_c, 0);
>>>>>
>>>>> __m128 res = _mm_add_ps(res_upper, res_lower);
>>>>>
>>>>> _mm_stream_ps(&result[i].data[0], res);
>>>>> }
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> On Wednesday, March 25, 2020 at 2:48:17 PM UTC-5, Dmitry Babokin wrote:
>>>>>>
>>>>>> > Would have expected ispc to be significantly faster.
>>>>>>
>>>>>> When comparing to already vectorized code, it should be on par, when 
>>>>>> targeting the same ISA. When switching from SSE2 to AVX2 I think it's 
>>>>>> reasonable to expect up to 2x speedup, if the memory is not a bottleneck.
>>>>>> Is it what you are observing. 
>>>>>>
>>>>>> Dmitry.
>>>>>>
>>>>>> On Tue, Mar 24, 2020 at 10:26 AM Jeff Rous <[email protected]> wrote:
>>>>>>
>>>>>>> Another thing you can try is manually unrolling to get better 
>>>>>>> instruction per clock throughput. See here: 
>>>>>>> https://ispc.godbolt.org/z/vcgYE8
>>>>>>>
>>>>>>> Jeff
>>>>>>>
>>>>>>> On Tuesday, March 24, 2020 at 12:53:43 AM UTC-7, David Nadaski wrote:
>>>>>>>>
>>>>>>>> Hey Jeff that was a crazy fast follow up, thank you so much!
>>>>>>>>
>>>>>>>> I've benchmarked the ISPC against the handmade SSE2 and 
>>>>>>>> surprisingly, their performance is very similar, almost identical. 
>>>>>>>> I've 
>>>>>>>> compiled ISPC with avx2-i32x16. Would have expected ispc to be 
>>>>>>>> significantly faster.
>>>>>>>> https://horugame.com/wp-content/uploads/2020/03/ISPCTest.zip This 
>>>>>>>> is a benchmark project where I've set both of the methods up to run on 
>>>>>>>> the 
>>>>>>>> same dataset.
>>>>>>>> Run build_ispc.bat to build the ISPC objs then open the sln in 
>>>>>>>> visual studio (2019) and run Release/x64.
>>>>>>>>
>>>>>>>> Again thanks for your time, this is amazing.
>>>>>>>>
>>>>>>>> David
>>>>>>>>
>>>>>>>> On Tuesday, March 24, 2020 at 1:33:02 AM UTC-5, Jeff Rous wrote:
>>>>>>>>>
>>>>>>>>> Hey David, give this is a shot. What this does is load 
>>>>>>>>> programCount / 4 float4s into your SIMD registers (1 for i32x4, 2 for 
>>>>>>>>> i32x8 
>>>>>>>>> and 4 for i32x16). To handle count that isn't divisible by that 
>>>>>>>>> number, I 
>>>>>>>>> left your original algorithm in to do single iterations. Think of 
>>>>>>>>> this like 
>>>>>>>>> a foreach using 4-wide vectors instead of 1-wide float. It should 
>>>>>>>>> scale up 
>>>>>>>>> to AVX-512 automatically. 
>>>>>>>>>
>>>>>>>>> Hopefully no bugs!
>>>>>>>>>
>>>>>>>>> https://ispc.godbolt.org/z/NVtkv4 
>>>>>>>>> <https://ispc.godbolt.org/z/Jrk5_Q>
>>>>>>>>>
>>>>>>>>> Jeff
>>>>>>>>>
>>>>>>>>> On Monday, March 23, 2020 at 3:43:41 PM UTC-7, David Nadaski wrote:
>>>>>>>>>>
>>>>>>>>>> Thanks Pete, that's great!
>>>>>>>>>>
>>>>>>>>>> Trying to understand the potential optimizations to be made to 
>>>>>>>>>> make this work w/ AVX2 and in the future AVX-512: when you say I'd 
>>>>>>>>>> have to 
>>>>>>>>>> rework the math or pack 2 vectors into an mm256, would that be 
>>>>>>>>>> something 
>>>>>>>>>> specific to AVX2 or would it automatically scale to AVX-512 as well?
>>>>>>>>>> In other words, will I potentially have to rework my ISPC code 
>>>>>>>>>> once AVX-512 becomes more widely adopted or can I write something 
>>>>>>>>>> now that 
>>>>>>>>>> will scale automatically thanks to ispc?
>>>>>>>>>> I can handcraft both AVX2 and AVX-512 implementations but was 
>>>>>>>>>> hoping ISPC would sort of do that for me. I understand that dealing 
>>>>>>>>>> with 
>>>>>>>>>> vectors and scalars is different.
>>>>>>>>>>
>>>>>>>>>> Thanks, David
>>>>>>>>>>
>>>>>>>>>> On Monday, March 23, 2020 at 4:51:39 PM UTC-5, Pete Brubaker 
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> No problem, let us know if we can help further.  I mentioned 
>>>>>>>>>>> this to Jeff and he's going to link some examples.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Mar 23, 2020 at 2:25 PM David Nadaski <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks so much Pete, this does seem to be the fastest version 
>>>>>>>>>>>> yet, rivaling the manually vectorized SSE2 code.
>>>>>>>>>>>> I will try and get ispc to pack the vectors into YMM and see 
>>>>>>>>>>>> how it fares. Thanks again.
>>>>>>>>>>>>
>>>>>>>>>>>> On Monday, March 23, 2020 at 3:18:25 PM UTC-5, Pete Brubaker 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi David,
>>>>>>>>>>>>>
>>>>>>>>>>>>> In looking over your code, unless you can reorder the data to 
>>>>>>>>>>>>> SoA your best strategy is to use the following method.   This is 
>>>>>>>>>>>>> what Jeff 
>>>>>>>>>>>>> does in Unreal.
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://ispc.godbolt.org/z/UdWjH-
>>>>>>>>>>>>>
>>>>>>>>>>>>> The compiler knows you're working on <4> element vectors and 
>>>>>>>>>>>>> will only use XMM registers.   To get to YMM registers you'd have 
>>>>>>>>>>>>> to rework 
>>>>>>>>>>>>> the math or do two 4 component vectors at the same time packed in 
>>>>>>>>>>>>> a single 
>>>>>>>>>>>>> 8 wide varying float.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Pete
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Mar 23, 2020 at 1:05 AM David Nadaski <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you, much appreciated!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Monday, March 23, 2020 at 2:25:13 AM UTC-5, Dmitry Babokin 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Current code is using only "uniform" data and operations, so 
>>>>>>>>>>>>>>> it's expectedly using only xmms.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've asked folks, who developed similar code, to have a look 
>>>>>>>>>>>>>>> and comment on the best strategies to express your code in ISPC.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Dmitry.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sun, Mar 22, 2020 at 2:42 PM David Nadaski <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I've also made an implementation that's using aos_to_soa4 / 
>>>>>>>>>>>>>>>> soa_to_aos4 which does use YMM but it's even slower than the 
>>>>>>>>>>>>>>>> other two:
>>>>>>>>>>>>>>>> https://ispc.godbolt.org/z/zNbbU2
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is what I'm ultimately trying to implement in ispc so 
>>>>>>>>>>>>>>>> that it can benefit from ispc's automatic AVXization etc:
>>>>>>>>>>>>>>>> https://godbolt.org/z/DwaTuW
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sunday, March 22, 2020 at 4:56:37 AM UTC-5, David 
>>>>>>>>>>>>>>>> Nadaski wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Dmitry, here it is: https://ispc.godbolt.org/z/8omkHp
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Basically I'm looking for the most optimized way of taking 
>>>>>>>>>>>>>>>>> in an array of Vector4 from c++ in AOS form and doing 
>>>>>>>>>>>>>>>>> calculations on them 
>>>>>>>>>>>>>>>>> in ispc.
>>>>>>>>>>>>>>>>> If I use foreach, ispc complains about stores and loads 
>>>>>>>>>>>>>>>>> and the generated code is slower, but it has YMM.
>>>>>>>>>>>>>>>>> If I use the code above (on godbolt, simple for), the code 
>>>>>>>>>>>>>>>>> is faster than the foreach version but lacks YMM, so I'm 
>>>>>>>>>>>>>>>>> guessing it could 
>>>>>>>>>>>>>>>>> be made more performant.
>>>>>>>>>>>>>>>>> Thank you for your help.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sunday, March 22, 2020 at 3:51:36 AM UTC-5, Dmitry 
>>>>>>>>>>>>>>>>> Babokin wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> David,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Typically if you target avx2 (using --target=avx2-i32x8), 
>>>>>>>>>>>>>>>>>> the code should use ymm registers. If you use 
>>>>>>>>>>>>>>>>>> --target=avx2-i32x4, then 
>>>>>>>>>>>>>>>>>> your data is 128 bit wide (for int and float vectors), which 
>>>>>>>>>>>>>>>>>> means that xmm 
>>>>>>>>>>>>>>>>>> registers will be used.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Another possibility that you are using uniform float<4>, 
>>>>>>>>>>>>>>>>>> which again means that you are operating on 128 bit vectors.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If you are already using avx2-i32x8 target, can you share 
>>>>>>>>>>>>>>>>>> the code via Compiler Explorer link, so I can see both the 
>>>>>>>>>>>>>>>>>> code and 
>>>>>>>>>>>>>>>>>> compilation flags?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Dmitry.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, Mar 21, 2020 at 4:17 PM David Nadaski <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you Oleh!
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Upon checking the generated assembly, I'm noticing that 
>>>>>>>>>>>>>>>>>>> it's executing against AVX2 but not using YMM registers at 
>>>>>>>>>>>>>>>>>>> all. Would you 
>>>>>>>>>>>>>>>>>>> know why?
>>>>>>>>>>>>>>>>>>> I've compiled it using -O2.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *AddVec4sProper_avx2:*
>>>>>>>>>>>>>>>>>>> 00007FF6F9E167E0  test        r8d,r8d  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E167E3  jle         AddVec4sProper_avx2+0B9h 
>>>>>>>>>>>>>>>>>>> (07FF6F9E16899h)  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E167E9  lea         eax,[r8-1]  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E167ED  mov         r9d,r8d  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E167F0  and         r9d,3  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E167F4  cmp         eax,3  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E167F7  jae         AddVec4sProper_avx2+26h 
>>>>>>>>>>>>>>>>>>> (07FF6F9E16806h)  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E167F9  xor         r10d,r10d  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E167FC  test        r9d,r9d  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E167FF  jne         AddVec4sProper_avx2+90h 
>>>>>>>>>>>>>>>>>>> (07FF6F9E16870h)  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16801  jmp         AddVec4sProper_avx2+0B9h 
>>>>>>>>>>>>>>>>>>> (07FF6F9E16899h)  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16806  sub         r8d,r9d  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16809  mov         eax,30h  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E1680E  xor         r10d,r10d  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16811  vmovss      xmm0,dword ptr 
>>>>>>>>>>>>>>>>>>> [__real@3f800000 (07FF6FA0EAF20h)]  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16819  nop         dword ptr [rax]  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16820  vmovaps     xmm1,xmmword ptr 
>>>>>>>>>>>>>>>>>>> [rdx+rax-30h]  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16826  vaddss      xmm1,xmm1,xmm0  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E1682A  vmovaps     xmmword ptr 
>>>>>>>>>>>>>>>>>>> [rcx+rax-30h],xmm1  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16830  vmovaps     xmm1,xmmword ptr 
>>>>>>>>>>>>>>>>>>> [rdx+rax-20h]  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16836  vaddss      xmm1,xmm1,xmm0  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E1683A  vmovaps     xmmword ptr 
>>>>>>>>>>>>>>>>>>> [rcx+rax-20h],xmm1  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16840  vmovaps     xmm1,xmmword ptr 
>>>>>>>>>>>>>>>>>>> [rdx+rax-10h]  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16846  vaddss      xmm1,xmm1,xmm0  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E1684A  vmovaps     xmmword ptr 
>>>>>>>>>>>>>>>>>>> [rcx+rax-10h],xmm1  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16850  vmovaps     xmm1,xmmword ptr 
>>>>>>>>>>>>>>>>>>> [rdx+rax]  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16855  vaddss      xmm1,xmm1,xmm0  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16859  vmovaps     xmmword ptr 
>>>>>>>>>>>>>>>>>>> [rcx+rax],xmm1  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E1685E  add         r10,4  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16862  add         rax,40h  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16866  cmp         r8d,r10d  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16869  jne         AddVec4sProper_avx2+40h 
>>>>>>>>>>>>>>>>>>> (07FF6F9E16820h)  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E1686B  test        r9d,r9d  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E1686E  je          AddVec4sProper_avx2+0B9h 
>>>>>>>>>>>>>>>>>>> (07FF6F9E16899h)  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16870  shl         r10,4  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16874  neg         r9d  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16877  vmovss      xmm0,dword ptr 
>>>>>>>>>>>>>>>>>>> [__real@3f800000 (07FF6FA0EAF20h)]  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E1687F  nop  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16880  vmovaps     xmm1,xmmword ptr 
>>>>>>>>>>>>>>>>>>> [rdx+r10]  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16886  vaddss      xmm1,xmm1,xmm0  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E1688A  vmovaps     xmmword ptr 
>>>>>>>>>>>>>>>>>>> [rcx+r10],xmm1  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16890  add         r10,10h  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16894  inc         r9d  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16897  jne         AddVec4sProper_avx2+0A0h 
>>>>>>>>>>>>>>>>>>> (07FF6F9E16880h)  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E16899  ret  
>>>>>>>>>>>>>>>>>>> 00007FF6F9E1689A  nop         word ptr [rax+rax]  
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Saturday, March 21, 2020 at 12:21:43 PM UTC-5, Oleh 
>>>>>>>>>>>>>>>>>>> Nechaev wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> struct Vector4SOA
>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>     float<4> V;
>>>>>>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> export void Test(uniform Vector4SOA outs[], uniform
>>>>>>>>>>>>>>>>>>>>  Vector4SOA ins[], uniform int count)
>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>     for (uniform int i=0; i< count ; ++i)
>>>>>>>>>>>>>>>>>>>>     {
>>>>>>>>>>>>>>>>>>>>         uniform Vector4SOA vv = ins[i];
>>>>>>>>>>>>>>>>>>>>         vv.V.x++; 
>>>>>>>>>>>>>>>>>>>> // builtin access by x y z w and r g b a
>>>>>>>>>>>>>>>>>>>>         outs[i] = vv;
>>>>>>>>>>>>>>>>>>>>     }
>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>> You received this message because you are subscribed to 
>>>>>>>>>>>>>>>>>>> the Google Groups "Intel SPMD Program Compiler Users" group.
>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails 
>>>>>>>>>>>>>>>>>>> from it, send an email to [email protected].
>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/ispc-users/cf0af12b-d84a-463c-be09-c8b35b3baf81%40googlegroups.com
>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/ispc-users/cf0af12b-d84a-463c-be09-c8b35b3baf81%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>>>>>> Google Groups "Intel SPMD Program Compiler Users" group.
>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails 
>>>>>>>>>>>>>>>> from it, send an email to [email protected].
>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/ispc-users/405aca52-e245-4c3a-8a0b-27bb35a3c0ee%40googlegroups.com
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/ispc-users/405aca52-e245-4c3a-8a0b-27bb35a3c0ee%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>>>> Google Groups "Intel SPMD Program Compiler Users" group.
>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>> https://groups.google.com/d/msgid/ispc-users/41351dc0-539d-4685-a8c6-e66dde273fd2%40googlegroups.com
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/ispc-users/41351dc0-539d-4685-a8c6-e66dde273fd2%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>> Google Groups "Intel SPMD Program Compiler Users" group.
>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>> https://groups.google.com/d/msgid/ispc-users/3514c98e-9e05-498b-8a2c-f4074b6b4f07%40googlegroups.com
>>>>>>>>>>>>  
>>>>>>>>>>>> <https://groups.google.com/d/msgid/ispc-users/3514c98e-9e05-498b-8a2c-f4074b6b4f07%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "Intel SPMD Program Compiler Users" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/ispc-users/98e1ed45-286c-44ed-8d37-0c0045d39a0f%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/ispc-users/98e1ed45-286c-44ed-8d37-0c0045d39a0f%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "Intel SPMD Program Compiler Users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/ispc-users/45204152-0c05-4fb4-8326-ac87d6e5e49a%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/ispc-users/45204152-0c05-4fb4-8326-ac87d6e5e49a%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Intel SPMD Program Compiler Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ispc-users/5f95e1b2-57eb-4084-8e1c-dcf84f12c006%40googlegroups.com.

Re: How to use struct without gather/scatter performance warnings

Reply via email to