On Fri, Jan 3, 2014 at 1:27 PM, Uros Bizjak <ubiz...@gmail.com> wrote:
>>> I am testing a patch that removes "max_align" part from ix86_data_alignment. >> >> That looks like unnecessary pessimization. Note the hunk in question is >> guarded with opt, which means it is an optimization rather than ABI issue, >> it can increase alignment, but the compiler can only assume the increased >> alignment if the symbol is not public or if it is public, but can't be >> preempted by another TU's definition. Even in that case it can be >> worthwhile to increase the alignment, say if doing versioning for alignment, >> or say just doing some AVX/AVX2/AVX512F version of memcpy/memset that can >> handle it faster if it is sufficiently aligned by testing it at runtime. >> >> So, IMHO we shouldn't drop it, just improve it. >> Perhaps: >> >> int max_align = optimize_size ? BITS_PER_WORD >> : MIN (512, MAX_OFILE_ALIGNMENT); >> >> if (opt >> && AGGREGATE_TYPE_P (type) >> && TYPE_SIZE (type) >> && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST >> && align < max_align) >> { >> int this_align = 256; >> for (this_align = 256; this_align <= max_align; this_align *= 2) >> if (TREE_INT_CST_LOW (TYPE_SIZE (type)) < (unsigned) this_align >> && !TREE_INT_CST_HIGH (TYPE_SIZE (type))) >> break; >> else if (align < this_align) >> align = this_align; >> } >> >> which will handle both the 256 bit alignment for >= 256 bit objects, >> 512 bit alignment for >= 512 bit objects and will be prepared for the >> future. >> >> 128 bit I think doesn't need to be handled, DATA_ALIGNMENT has been >> using 256-bit test already since it's introduction in 1998: >> http://gcc.gnu.org/ml/gcc-bugs/1998-03/msg00011.html > > Thanks for the pointer, there is indeed the recommendation in > optimization manual [1], section 3.6.4, where it is said: > > --quote-- > Misaligned data access can incur significant performance penalties. > This is particularly true for cache line > splits. The size of a cache line is 64 bytes in the Pentium 4 and > other recent Intel processors, including > processors based on Intel Core microarchitecture. > An access to data unaligned on 64-byte boundary leads to two memory > accesses and requires several > μops to be executed (instead of one). Accesses that span 64-byte > boundaries are likely to incur a large > performance penalty, the cost of each stall generally are greater on > machines with longer pipelines. > > ... > > A 64-byte or greater data structure or array should be aligned so that > its base address is a multiple of 64. > Sorting data in decreasing size order is one heuristic for assisting > with natural alignment. As long as 16- > byte boundaries (and cache lines) are never crossed, natural alignment > is not strictly necessary (though > it is an easy way to enforce this). > --/quote-- > > So, this part has nothing to do with AVX512, but with cache line > width. And we do have a --param "l1-cache-line-size=64", detected with > -march=native that could come handy here. > > This part should be rewritten (and commented) with the information > above in mind. Like in the patch below. Please note, that the block_tune setting for the nocona is wrong, -march=native on my trusted old P4 returns: --param "l1-cache-size=16" --param "l1-cache-line-size=64" --param "l2-cache-size=2048" "-mtune=nocona" which is consistent with the above quote from manual. 2014-01-02 Uros Bizjak <ubiz...@gmail.com> * config/i386/i386.c (ix86_data_alignment): Calculate max_align from prefetch_block tune setting. (nocona_cost): Correct size of prefetch block to 64. The patch was bootstrapped on x86_64-pc-linux-gnu and is currently in regression testing. If there are no comments, I will commit it to mainline and release branches after a couple of days. Uros.
Index: i386.c =================================================================== --- i386.c (revision 206311) +++ i386.c (working copy) @@ -1568,7 +1568,7 @@ struct processor_costs nocona_cost = { 8, /* MMX or SSE register to integer */ 8, /* size of l1 cache. */ 1024, /* size of l2 cache. */ - 128, /* size of prefetch block */ + 64, /* size of prefetch block */ 8, /* number of parallel prefetches */ 1, /* Branch cost */ COSTS_N_INSNS (6), /* cost of FADD and FSUB insns. */ @@ -26465,9 +26465,21 @@ ix86_constant_alignment (tree exp, int align) int ix86_data_alignment (tree type, int align, bool opt) { - int max_align = optimize_size ? BITS_PER_WORD - : MIN (512, MAX_OFILE_ALIGNMENT); + /* Misaligned data access can incur significant performance penalties. + This is particularly true for cache line splits. The size of a cache + line is 64 bytes in the Pentium 4 and other recent Intel processors, + including processors based on Intel Core microarchitecture. + An access to data unaligned on 64-byte boundary leads to two memory + accesses and requires several μops to be executed (instead of one). + A 64-byte or greater data structure or array should be aligned so that + its base address is a multiple of 64. */ + int max_align + = MIN ((unsigned) ix86_tune_cost->prefetch_block * 8, MAX_OFILE_ALIGNMENT); + + if (max_align < BITS_PER_WORD) + max_align = BITS_PER_WORD; + if (opt && AGGREGATE_TYPE_P (type) && TYPE_SIZE (type)