[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-02 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

Hanoch Haim  changed:

   What|Removed |Added

 Status|REOPENED|RESOLVED
 Resolution|--- |INVALID

--- Comment #25 from Hanoch Haim  ---
Hi Richard,

You were right all along. I've looked into the wrong place!
I understand it now and it is not a gcc issue. gcc7/8 are just better than gcc
6 with code generation.  

1. The alignment is contagious, gcc marks all the parent objects of such an
object as aligned.  

2. With static allocated object there is no issue. 

3. The issue in my case was a dynamic allocation of a different object that
includes the aligned object. The object(parent) is assumed to be aligned, but
was allocated dynamically (not aligned)  


Thank you for the explanation.

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-02 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #24 from Richard Biener  ---
(In reply to Richard Biener from comment #23)
> (In reply to Hanoch Haim from comment #22)
> >  
> > "Of course it does, because without aligning the container you cannot have
> > aligned members.  Maximum alignment always propagates outwards."
> > 
> > Sorry, your answer is still not  clear, so let give a short example
> > In this case there is a discrepancy betwean two gcc modules  
> > 
> > 1. The module that generates the code think that it is aligned
> > (CCPortLatency)
> > 2. However the linker puts it in a none aligned location  
> > 
> > 
> > "
> > class CTimeHistogram {
> > 
> > } __rte_cache_aligned;
> > 
> > class CCPortLatency {
> > public:
> >  CTimeHistogram  m_hist;  
> > }; 
> > class Root {
> > 
> > CCPortLatency port;
> > 
> > } __rte_cache_aligned;
> > 
> > static Root root; 
> > "
> > 
> > In this case can I expect root.port to be aligned because its child (m_hist)
> > was defined as aligned and it propogate? Or should I explicitly ask both to
> > be aligned?
> 
> Yes, for
> 
> class CTimeHistogram {
> } __attribute__((aligned(64)));
> class CCPortLatency {
> public:
> CTimeHistogram  m_hist;
> };
> class Root {
> CCPortLatency port;
> };
> static Root root;
> 
> 'root' will be aligned to 64 bytes.  This is also what you can easily
> observe when inspecting the ELF object:
> 
> Section Headers:
>   [Nr] Name  Type Address   Offset
>Size  EntSize  Flags  Link  Info  Align
> ...
>   [ 3] .bss  NOBITS     0040
>0040    WA   0 0 64
> 
> Symbol table '.symtab' contains 8 entries:
>Num:Value  Size TypeBind   Vis  Ndx Name
> ...
>  5: 64 OBJECT  LOCAL  DEFAULT3 _ZL4root
> 
> compiled without optimization since root is unused and will otherwise
> be eliminated.

You can also check with

static_assert (__alignof__(root.port.m_hist) == 64, "oops");

(need to make port public for this, eh)

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-02 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #23 from Richard Biener  ---
(In reply to Hanoch Haim from comment #22)
>  
> "Of course it does, because without aligning the container you cannot have
> aligned members.  Maximum alignment always propagates outwards."
> 
> Sorry, your answer is still not  clear, so let give a short example
> In this case there is a discrepancy betwean two gcc modules  
> 
> 1. The module that generates the code think that it is aligned
> (CCPortLatency)
> 2. However the linker puts it in a none aligned location  
> 
> 
> "
> class CTimeHistogram {
> 
> } __rte_cache_aligned;
> 
> class CCPortLatency {
> public:
>  CTimeHistogram  m_hist;  
> }; 
> class Root {
> 
>   CCPortLatency port;
> 
> } __rte_cache_aligned;
> 
> static Root root; 
> "
> 
> In this case can I expect root.port to be aligned because its child (m_hist)
> was defined as aligned and it propogate? Or should I explicitly ask both to
> be aligned?

Yes, for

class CTimeHistogram {
} __attribute__((aligned(64)));
class CCPortLatency {
public:
CTimeHistogram  m_hist;
};
class Root {
CCPortLatency port;
};
static Root root;

'root' will be aligned to 64 bytes.  This is also what you can easily
observe when inspecting the ELF object:

Section Headers:
  [Nr] Name  Type Address   Offset
   Size  EntSize  Flags  Link  Info  Align
...
  [ 3] .bss  NOBITS     0040
   0040    WA   0 0 64

Symbol table '.symtab' contains 8 entries:
   Num:Value  Size TypeBind   Vis  Ndx Name
...
 5: 64 OBJECT  LOCAL  DEFAULT3 _ZL4root

compiled without optimization since root is unused and will otherwise
be eliminated.

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-02 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #22 from Hanoch Haim  ---

"Of course it does, because without aligning the container you cannot have
aligned members.  Maximum alignment always propagates outwards."

Sorry, your answer is still not  clear, so let give a short example
In this case there is a discrepancy betwean two gcc modules  

1. The module that generates the code think that it is aligned (CCPortLatency)
2. However the linker puts it in a none aligned location  


"
class CTimeHistogram {

} __rte_cache_aligned;

class CCPortLatency {
public:
 CTimeHistogram  m_hist;  
}; 
class Root {

CCPortLatency port;

} __rte_cache_aligned;

static Root root; 
"

In this case can I expect root.port to be aligned because its child (m_hist)
was defined as aligned and it propogate? Or should I explicitly ask both to be
aligned?

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-02 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #21 from Richard Biener  ---
(In reply to Hanoch Haim from comment #19)
> After some investigation, I think it is not a gcc issue,  please verify. 
> One of the internal object does not include a 64B alignment.
> 
> #define __rte_cache_aligned __attribute__((__aligned__(64)));
> 
> class CTimeHistogram {
> 
> } __rte_cache_aligned;
> 
> 
> class CCPortLatency {
> public:
>  CTimeHistogram  m_hist;  
> } __rte_cache_aligned;  <<= without this, it is not aligned while the code
> generation assumed it is aligned !
> 
> class Root {
> 
>   CCPortLatency port;
> 
> } __rte_cache_aligned;
> 
> 
> Is it valid? why the code generation assumed the CCPortLatency is aligned
> because one of its internal is aligned?

Of course it does, because without aligning the container you cannot have
aligned members.  Maximum alignment always propagates outwards.

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #20 from Hanoch Haim  ---
One more thing. I would expect that the issue would be in CTimeHistogram
functions (defined as aligned) but the code generation issue was in the parent
object ( CCPortLatency) 
Why the compiler assumed that if one of the internal objects is defined as
aligned the  parent is aligned too?

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #19 from Hanoch Haim  ---
After some investigation, I think it is not a gcc issue,  please verify. 
One of the internal object does not include a 64B alignment.

#define __rte_cache_aligned __attribute__((__aligned__(64)));

class CTimeHistogram {

} __rte_cache_aligned;


class CCPortLatency {
public:
 CTimeHistogram  m_hist;  
} __rte_cache_aligned;  <<= without this, it is not aligned while the code
generation assumed it is aligned !

class Root {

CCPortLatency port;

} __rte_cache_aligned;


Is it valid? why the code generation assumed the CCPortLatency is aligned
because one of its internal is aligned?

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #18 from Alexander Monakov  ---
It seems the problem is not in CCPortLatency::Create, but rather one of its
callers. Try to investigate in gdb where a misaligned pointer is derived from a
64-byte aligned pointer to the toplevel g_trex object (i.e. work up the stack
from the point of the crash).

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #17 from Uroš Bizjak  ---
The asm dump claims that the access is aligned to 32bytes:

#(insn 14 31 9 2 (set (mem:V4DI (plus:DI (reg/f:DI 3 bx [orig:90 this ] [90])
#(const_int 64 [0x40])) [6 MEM[(long unsigned int *)this_6(D) +
64B]+0 S32 A256])
#(reg:V4DI 21 xmm0 [92])) "../../src/stateful_rx_core.cpp":254 1228
{movv4di_internal}
# (nil))
vmovdqa %ymm0, 64(%rbx) # 14movv4di_internal/4  [length = 5]


which gets expanded from:

;; MEM[(long unsigned int *)this_6(D) + 64B] = { 0, 0, 0, 0 };

(insn 13 12 14 (set (reg:V4DI 92)
(const_vector:V4DI [
(const_int 0 [0])
(const_int 0 [0])
(const_int 0 [0])
(const_int 0 [0])
])) "../../src/stateful_rx_core.cpp":254 -1
 (nil))

(insn 14 13 0 (set (mem:V4DI (plus:DI (reg/f:DI 90 [ this ])
(const_int 64 [0x40])) [6 MEM[(long unsigned int *)this_6(D) +
64B]+0 S32 A256])
(reg:V4DI 92)) "../../src/stateful_rx_core.cpp":254 -1
 (nil))

So, not a target issue.

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #16 from Hanoch Haim  ---
The global/parent object CGlobalTRex is aligned (64B) as expected:

(gdb) p _trex
$1 = (CGlobalTRex *) 0xc365c0 

Could you explain why it is a problem to define the internal objects with the
aligment like the parent (64B)?

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #15 from Richard Biener  ---
(In reply to Richard Biener from comment #14)
> (In reply to Hanoch Haim from comment #12)
> > Removing __rte_cache_aligned does not solve the issue
> > 
> >  
> > diff --git a/src/time_histogram.h b/src/time_histogram.h
> > index 07e66b49..26a37248 100755
> > --- a/src/time_histogram.h
> > +++ b/src/time_histogram.h
> > @@ -133,10 +133,10 @@ private:
> >  uint32_t m_win_cnt;
> >  uint32_t m_hot_max;
> >  dsec_t   m_max_ar[HISTOGRAM_QUEUE_SIZE]; // Array of maximum latencies
> > for previous periods
> > -uint64_t m_hcnt[HISTOGRAM_SIZE_LOG][HISTOGRAM_SIZE] __rte_cache_aligned
> > ;
> > +uint64_t m_hcnt[HISTOGRAM_SIZE_LOG][HISTOGRAM_SIZE]  ;
> >  // Hdr histogram instance
> >  hdr_histogram *m_hdrh;
> > -};
> > +} __rte_cache_aligned;
> 
> There are more aligned attributes.  I see
> 
> class CLatencyManager : public TrexRxCore {
> ...
>  volatile bool m_do_stop __attribute__((__aligned__(64))) ;
> 
> struct rte_ring {
>  char name[32] __attribute__((__aligned__(64)));
> 
> class CFlowGenListPerThread {
> ...
> } __attribute__((__aligned__(64)));
> 
> etc.
> 
> Can you check the .bss section Alignment in the final executable/shared
> object?
> Do you by chance substitute the program loader for something not honoring
> large alignment of .bss sections?

You can also check with a debugger whether your global static object
is properly aligned.

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #14 from Richard Biener  ---
(In reply to Hanoch Haim from comment #12)
> Removing __rte_cache_aligned does not solve the issue
> 
>  
> diff --git a/src/time_histogram.h b/src/time_histogram.h
> index 07e66b49..26a37248 100755
> --- a/src/time_histogram.h
> +++ b/src/time_histogram.h
> @@ -133,10 +133,10 @@ private:
>  uint32_t m_win_cnt;
>  uint32_t m_hot_max;
>  dsec_t   m_max_ar[HISTOGRAM_QUEUE_SIZE]; // Array of maximum latencies
> for previous periods
> -uint64_t m_hcnt[HISTOGRAM_SIZE_LOG][HISTOGRAM_SIZE] __rte_cache_aligned
> ;
> +uint64_t m_hcnt[HISTOGRAM_SIZE_LOG][HISTOGRAM_SIZE]  ;
>  // Hdr histogram instance
>  hdr_histogram *m_hdrh;
> -};
> +} __rte_cache_aligned;

There are more aligned attributes.  I see

class CLatencyManager : public TrexRxCore {
...
 volatile bool m_do_stop __attribute__((__aligned__(64))) ;

struct rte_ring {
 char name[32] __attribute__((__aligned__(64)));

class CFlowGenListPerThread {
...
} __attribute__((__aligned__(64)));

etc.

Can you check the .bss section Alignment in the final executable/shared object?
Do you by chance substitute the program loader for something not honoring
large alignment of .bss sections?

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #13 from Hanoch Haim  ---
One more thing, The parent object is defined with 64Byte alignment 

class CGlobalTRex  {
..

} __rte_cache_aligned;

static CGlobalTRex  trex;

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

Hanoch Haim  changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|INVALID |---

--- Comment #12 from Hanoch Haim  ---
Removing __rte_cache_aligned does not solve the issue


diff --git a/src/time_histogram.h b/src/time_histogram.h
index 07e66b49..26a37248 100755
--- a/src/time_histogram.h
+++ b/src/time_histogram.h
@@ -133,10 +133,10 @@ private:
 uint32_t m_win_cnt;
 uint32_t m_hot_max;
 dsec_t   m_max_ar[HISTOGRAM_QUEUE_SIZE]; // Array of maximum latencies for
previous periods
-uint64_t m_hcnt[HISTOGRAM_SIZE_LOG][HISTOGRAM_SIZE] __rte_cache_aligned ;
+uint64_t m_hcnt[HISTOGRAM_SIZE_LOG][HISTOGRAM_SIZE]  ;
 // Hdr histogram instance
 hdr_histogram *m_hdrh;
-};
+} __rte_cache_aligned;

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #11 from Hanoch Haim  ---
thanks for the quick answer. 
The parent object is static (bss) and wasn't dynmicly allocated using
new/malloc. 
gcc set the address of the parent object and the childs. 

Is there a way to solve it without removing the alignment?

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

Richard Biener  changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution|--- |INVALID

--- Comment #10 from Richard Biener  ---
I can reproduce the aligned stores with -O3 -march=core-avx2 with GCC 7/8/9.

note that when I check alignof(CCPortLatency) I do get 64 byte alignment
because it has a private member of type CTimeHistogram which has a member

uint64_t m_hcnt[HISTOGRAM_SIZE_LOG][HISTOGRAM_SIZE]
__attribute__((__aligned__(64))) ;

This kind of overaligned type doesn't play well with "old" C++ new but
you need support for overaligned types which is only in newer C++ standards
or resort to posix_memalign or friends to allocate memory.

Or simply drop the aligned attribute from above.

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #9 from Hanoch Haim  ---
Attached. I hope this is what you are looking for.

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #8 from Hanoch Haim  ---
Created attachment 46542
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46542=edit
stateful_rx_core.ss

[Bug target/91043] GCC produces unaligned vmovdqa vector data access

2019-07-01 Thread hhaim at cisco dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91043

--- Comment #7 from Hanoch Haim  ---
Created attachment 46541
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46541=edit
stateful_rx_core.ii

compress ii