Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.

Uros Bizjak Fri, 03 Jan 2014 05:36:08 -0800

On Fri, Jan 3, 2014 at 1:27 PM, Uros Bizjak <ubiz...@gmail.com> wrote:


>>> I am testing a patch that removes "max_align" part from ix86_data_alignment.
>>
>> That looks like unnecessary pessimization.  Note the hunk in question is
>> guarded with opt, which means it is an optimization rather than ABI issue,
>> it can increase alignment, but the compiler can only assume the increased
>> alignment if the symbol is not public or if it is public, but can't be
>> preempted by another TU's definition.  Even in that case it can be
>> worthwhile to increase the alignment, say if doing versioning for alignment,
>> or say just doing some AVX/AVX2/AVX512F version of memcpy/memset that can
>> handle it faster if it is sufficiently aligned by testing it at runtime.
>>
>> So, IMHO we shouldn't drop it, just improve it.
>> Perhaps:
>>
>>   int max_align = optimize_size ? BITS_PER_WORD
>>                                 : MIN (512, MAX_OFILE_ALIGNMENT);
>>
>>   if (opt
>>       && AGGREGATE_TYPE_P (type)
>>       && TYPE_SIZE (type)
>>       && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST
>>       && align < max_align)
>>     {
>>       int this_align = 256;
>>       for (this_align = 256; this_align <= max_align; this_align *= 2)
>>         if (TREE_INT_CST_LOW (TYPE_SIZE (type)) < (unsigned) this_align
>>             && !TREE_INT_CST_HIGH (TYPE_SIZE (type)))
>>           break;
>>         else if (align < this_align)
>>           align = this_align;
>>     }
>>
>> which will handle both the 256 bit alignment for >= 256 bit objects,
>> 512 bit alignment for >= 512 bit objects and will be prepared for the
>> future.
>>
>> 128 bit I think doesn't need to be handled, DATA_ALIGNMENT has been
>> using 256-bit test already since it's introduction in 1998:
>> http://gcc.gnu.org/ml/gcc-bugs/1998-03/msg00011.html
>
> Thanks for the pointer, there is indeed the recommendation in
> optimization manual [1], section 3.6.4, where it is said:
>
> --quote--
> Misaligned data access can incur significant performance penalties.
> This is particularly true for cache line
> splits. The size of a cache line is 64 bytes in the Pentium 4 and
> other recent Intel processors, including
> processors based on Intel Core microarchitecture.
> An access to data unaligned on 64-byte boundary leads to two memory
> accesses and requires several
> μops to be executed (instead of one). Accesses that span 64-byte
> boundaries are likely to incur a large
> performance penalty, the cost of each stall generally are greater on
> machines with longer pipelines.
>
> ...
>
> A 64-byte or greater data structure or array should be aligned so that
> its base address is a multiple of 64.
> Sorting data in decreasing size order is one heuristic for assisting
> with natural alignment. As long as 16-
> byte boundaries (and cache lines) are never crossed, natural alignment
> is not strictly necessary (though
> it is an easy way to enforce this).
> --/quote--
>
> So, this part has nothing to do with AVX512, but with cache line
> width. And we do have a --param "l1-cache-line-size=64", detected with
> -march=native that could come handy here.
>
> This part should be rewritten (and commented) with the information
> above in mind.

Like in the patch below. Please note, that the block_tune setting for
the nocona is wrong, -march=native on my trusted old P4 returns:

--param "l1-cache-size=16" --param "l1-cache-line-size=64" --param
"l2-cache-size=2048" "-mtune=nocona"

which is consistent with the above quote from manual.

2014-01-02  Uros Bizjak  <ubiz...@gmail.com>

    * config/i386/i386.c (ix86_data_alignment): Calculate max_align
    from prefetch_block tune setting.
    (nocona_cost): Correct size of prefetch block to 64.

The patch was bootstrapped on x86_64-pc-linux-gnu and is currently in
regression testing. If there are no comments, I will commit it to
mainline and release branches after a couple of days.

Uros.

Index: i386.c
===================================================================
--- i386.c      (revision 206311)
+++ i386.c      (working copy)
@@ -1568,7 +1568,7 @@ struct processor_costs nocona_cost = {
   8,                                   /* MMX or SSE register to integer */
   8,                                   /* size of l1 cache.  */
   1024,                                        /* size of l2 cache.  */
-  128,                                 /* size of prefetch block */
+  64,                                  /* size of prefetch block */
   8,                                   /* number of parallel prefetches */
   1,                                   /* Branch cost */
   COSTS_N_INSNS (6),                   /* cost of FADD and FSUB insns.  */
@@ -26465,9 +26465,21 @@ ix86_constant_alignment (tree exp, int align)
 int
 ix86_data_alignment (tree type, int align, bool opt)
 {
-  int max_align = optimize_size ? BITS_PER_WORD
-                               : MIN (512, MAX_OFILE_ALIGNMENT);
+  /* Misaligned data access can incur significant performance penalties.
+     This is particularly true for cache line splits. The size of a cache
+     line is 64 bytes in the Pentium 4 and other recent Intel processors,
+     including processors based on Intel Core microarchitecture.
+     An access to data unaligned on 64-byte boundary leads to two memory
+     accesses and requires several μops to be executed (instead of one).
+     A 64-byte or greater data structure or array should be aligned so that
+     its base address is a multiple of 64.  */
 
+  int max_align
+    = MIN ((unsigned) ix86_tune_cost->prefetch_block * 8, MAX_OFILE_ALIGNMENT);
+
+  if (max_align < BITS_PER_WORD)
+    max_align = BITS_PER_WORD;
+
   if (opt
       && AGGREGATE_TYPE_P (type)
       && TYPE_SIZE (type)

Re: [PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.

Reply via email to