On Wed, Dec 12, 2012 at 9:06 AM, Christophe Lyon <christophe.l...@linaro.org> wrote: > On 11 December 2012 13:26, Tim Prince <n...@aol.com> wrote: >> On 12/11/2012 5:14 AM, Richard Earnshaw wrote: >>> >>> On 11/12/12 09:56, Richard Biener wrote: >>>> >>>> On Tue, Dec 11, 2012 at 10:48 AM, Richard Earnshaw <rearn...@arm.com> >>>> wrote: >>>>> >>>>> On 11/12/12 09:45, Richard Biener wrote: >>>>>> >>>>>> >>>>>> On Mon, Dec 10, 2012 at 10:07 PM, Andi Kleen <a...@firstfloor.org> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> Jan Hubicka <hubi...@ucw.cz> writes: >>>>>>> >>>>>>>> Note that I think Core has similar characteristics - at least for >>>>>>>> string >>>>>>>> operations >>>>>>>> it fares well with unalignes accesses. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Nehalem and later has very fast unaligned vector loads. There's still >>>>>>> some >>>>>>> penalty when they cross cache lines however. >>>>>>> >>>>>>> iirc the rule of thumb is to do unaligned for 128 bit vectors, >>>>>>> but avoid it for 256bit vectors because the cache line cross >>>>>>> penalty is larger on Sandy Bridge and more likely with the larger >>>>>>> vectors. >>>>>> >>>>>> >>>>>> >>>>>> Yes, I think the rule was that using the unaligned instruction variants >>>>>> carries >>>>>> no penalty when the actual access is aligned but that aligned accesses >>>>>> are >>>>>> still faster than unaligned accesses. Thus peeling for alignment _is_ >>>>>> a >>>>>> win. >>>>>> I also seem to remember that the story for unaligned stores vs. >>>>>> unaligned >>>>>> loads >>>>>> is usually different. >>>>> >>>>> >>>>> >>>>> Yes, it's generally the case that unaligned loads are slightly more >>>>> expensive than unaligned stores, since the stores can often merge in a >>>>> store >>>>> buffer with little or no penalty. >>>> >>>> >>>> It was the other way around on AMD CPUs AFAIK - unaligned stores forced >>>> flushes of the store buffers. Which is why the vectorizer first and >>>> foremost tries >>>> to align stores. >>>> >>> >>> In which case, which to align should be a question that the ME asks the >>> BE. >>> >>> R. >>> >>> >> I see that this thread is no longer about ARM. >> Yes, when peeling for alignment, aligned stores should take precedence over >> aligned loads. >> "ivy bridge" corei7-3 is supposed to have corrected the situation on "sandy >> bridge" corei7-2 where unaligned 256-bit load is more expensive than >> explicitly split (128-bit) loads. There aren't yet any production >> multi-socket corei7-3 platforms. >> It seems difficult to make the best decision between 128-bit unaligned >> access without peeling and 256-bit access with peeling for alignment (unless >> the loop count is known to be too small for the latter to come up to speed). >> Facilities afforded by various compilers to allow the programmer to guide >> this choice are rather strange and probably not to be counted on. >> In my experience, "westmere" unaligned 128-bit loads are more expensive than >> explicitly split (64-bit) loads, but the architecture manuals disagree with >> this finding. gcc already does a good job for corei7[-1] in such >> situations. >> >> -- >> Tim Prince >> > > Since this thread is also about x86 now, I have tried to look at how > things are implemented on this target. > People have mentioned nehalem, sandy bridge, ivy bridge and westmere; > I have searched for occurrences of these strings in GCC, and I > couldn't find anything that would imply a different behavior wrt > unaligned loads on 128/256 bits vectors. Is it still unimplemented? >
i386.c has { /* When not optimize for size, enable vzeroupper optimization for TARGET_AVX with -fexpensive-optimizations and split 32-byte AVX unaligned load/store. */ if (!optimize_size) { if (flag_expensive_optimizations && !(target_flags_explicit & MASK_VZEROUPPER)) target_flags |= MASK_VZEROUPPER; if ((x86_avx256_split_unaligned_load & ix86_tune_mask) && !(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_LOAD)) target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD; if ((x86_avx256_split_unaligned_store & ix86_tune_mask) && !(target_flags_explicit & MASK_AVX256_SPLIT_UNALIGNED_STORE)) target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE; /* Enable 128-bit AVX instruction generation for the auto-vectorizer. */ if (TARGET_AVX128_OPTIMAL && !(target_flags_explicit & MASK_PREFER_AVX128)) target_flags |= MASK_PREFER_AVX128; } } -- H.J.