On Tue, Jun 24, 2014 at 4:05 AM, Richard Biener <richard.guent...@gmail.com> wrote: > On Sat, May 3, 2014 at 2:39 AM, Cong Hou <co...@google.com> wrote: >> On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener <rguent...@suse.de> wrote: >>> On Thu, 24 Apr 2014, Cong Hou wrote: >>> >>>> Given the following loop: >>>> >>>> int a[N]; >>>> short b[N*2]; >>>> >>>> for (int i = 0; i < N; ++i) >>>> a[i] = b[i*2]; >>>> >>>> >>>> After being vectorized, the access to b[i*2] will be compiled into >>>> several packing statements, while the type promotion from short to int >>>> will be compiled into several unpacking statements. With this patch, >>>> each pair of pack/unpack statements will be replaced by less expensive >>>> statements (with shift or bit-and operations). >>>> >>>> On x86_64, the loop above will be compiled into the following assembly >>>> (with -O2 -ftree-vectorize): >>>> >>>> movdqu 0x10(%rcx),%xmm3 >>>> movdqu -0x20(%rcx),%xmm0 >>>> movdqa %xmm0,%xmm2 >>>> punpcklwd %xmm3,%xmm0 >>>> punpckhwd %xmm3,%xmm2 >>>> movdqa %xmm0,%xmm3 >>>> punpcklwd %xmm2,%xmm0 >>>> punpckhwd %xmm2,%xmm3 >>>> movdqa %xmm1,%xmm2 >>>> punpcklwd %xmm3,%xmm0 >>>> pcmpgtw %xmm0,%xmm2 >>>> movdqa %xmm0,%xmm3 >>>> punpckhwd %xmm2,%xmm0 >>>> punpcklwd %xmm2,%xmm3 >>>> movups %xmm0,-0x10(%rdx) >>>> movups %xmm3,-0x20(%rdx) >>>> >>>> >>>> With this patch, the generated assembly is shown below: >>>> >>>> movdqu 0x10(%rcx),%xmm0 >>>> movdqu -0x20(%rcx),%xmm1 >>>> pslld $0x10,%xmm0 >>>> psrad $0x10,%xmm0 >>>> pslld $0x10,%xmm1 >>>> movups %xmm0,-0x10(%rdx) >>>> psrad $0x10,%xmm1 >>>> movups %xmm1,-0x20(%rdx) >>>> >>>> >>>> Bootstrapped and tested on x86-64. OK for trunk? >>> >>> This is an odd place to implement such transform. Also if it >>> is faster or not depends on the exact ISA you target - for >>> example ppc has constraints on the maximum number of shifts >>> carried out in parallel and the above has 4 in very short >>> succession. Esp. for the sign-extend path. >> >> Thank you for the information about ppc. If this is an issue, I think >> we can do it in a target dependent way. >> >> >>> >>> So this looks more like an opportunity for a post-vectorizer >>> transform on RTL or for the vectorizer special-casing >>> widening loads with a vectorizer pattern. >> >> I am not sure if the RTL transform is more difficult to implement. I >> prefer the widening loads method, which can be detected in a pattern >> recognizer. The target related issue will be resolved by only >> expanding the widening load on those targets where this pattern is >> beneficial. But this requires new tree operations to be defined. What >> is your suggestion? >> >> I apologize for the delayed reply. > > Likewise ;) > > I suggest to implement this optimization in vector lowering in > tree-vect-generic.c. This sees for your example > > vect__5.7_32 = MEM[symbol: b, index: ivtmp.15_13, offset: 0B]; > vect__5.8_34 = MEM[symbol: b, index: ivtmp.15_13, offset: 16B]; > vect_perm_even_35 = VEC_PERM_EXPR <vect__5.7_32, vect__5.8_34, { 0, > 2, 4, 6, 8, 10, 12, 14 }>; > vect__6.9_37 = [vec_unpack_lo_expr] vect_perm_even_35; > vect__6.9_38 = [vec_unpack_hi_expr] vect_perm_even_35; > > where you can apply the pattern matching and transform (after checking > with the target, of course).
This sounds good to me! I'll try to make a patch following your suggestion. Thank you! Cong > > Richard. > >> >> thanks, >> Cong >> >>> >>> Richard.