[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2016-03-04 Thread bonzini at gnu dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

Paolo Bonzini  changed:

   What|Removed |Added

 CC||bonzini at gnu dot org

--- Comment #20 from Paolo Bonzini  ---
> how to efficiently access unaligned memory ?

Use memcpy between unsigned char pointers and with a constant size.  The
compiler knows to translate it to an unaligned memory access, or even a
combination of unaligned and aligned memory accesses:

$ cat f.c
int f(char *restrict a, const char *restrict b)
{
int i;
for (i = 0; i < 512; i++)
a[i] = b[i];
}

$ gcc f.c -O3 -S -o f.s -fdump-tree-optimized
$ cat f.c.191t.optimized

;; Function f (f, funcdef_no=0, decl_uid=1832, cgraph_uid=0, symbol_order=0)

f (char * restrict a, const char * restrict b)
{
  :
  __builtin_memcpy (a_5(D), b_7(D), 512); [tail call]
  return;

}

$ cat f.s
...
f:
.cfi_startproc
movq(%rsi), %rdx
movq%rdi, %rax
leaq8(%rdi), %rdi
movq%rdx, -8(%rdi)
movq504(%rsi), %rdx
movq%rdx, 496(%rdi)
andq$-8, %rdi
subq%rdi, %rax
subq%rax, %rsi
addl$512, %eax
shrl$3, %eax
movl%eax, %ecx
rep movsq
ret
...

It's doing unaligned accesses for the first and last 8 bytes, and 31 aligned
8-byte accesses in the middle.

[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2015-08-12 Thread yann.collet.73 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #18 from Yann Collet yann.collet.73 at gmail dot com ---
This issue makes me wonder : how to efficiently access unaligned memory ?


The case in point is ARM cpu.
They don't support SSE/AVX, so they seem unaffected by this specific issue,
but this issue force writing the source code in a certain way, to remain
compatible with vectorizer assumtion.
Therefore, for portable code, it becomes an issue :
how to write a code which is both portable and efficient on both targets ?


Since apparently, writing : u32 = *(U32*)ptr;
is forbidden if ptr is not guaranteed to be aligned on 4-bytes boundaries
as the compiler will then be authorized to assume ptr is properly aligned,
how to efficiently load 4-bytes from memory at unaligned position ?

I know 3 ways :

1) byte by byte : secure, but slow == not efficient

2) using memcpy : memcpy(u32, ptr, sizeof(u32));
It works. It's safe, and on x86/x64 it's correctly translated into a single mov
instruction, so it's also fast.
Alas, on ARM target, this get translated into much more complex /cautious
sequence, depending on optimization settings.
This is not a small difference :
at -O3 settings, we get a x2 performance difference.
at -O2 settings, it becomes x5 (unaligned code is slower).

3) using __packed instruction : Basically, feature the same benefits and
problems than memcpy() method above


The problem is therefore for newer ARM CPU, which efficiently support unaligned
memory.
Accessing this performance is not possible using memcpy() nor __packed.
And it seems the only way to get it is to do : u32 = *(U32*)ptr;

The difference in performance is really huge, in fact it totally changes the
application, so it can't be ignored.


The question is :
Is there a way to access this performance without violating the principle which
has been stated into this thread, 
that is : it's not authorized to write : u32 = *(U32*)ptr; if ptr is not
guaranteed to be properly aligned on 4-bytes boundaries.


[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2015-08-12 Thread noloader at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #19 from Jeffrey Walton noloader at gmail dot com ---
(In reply to Yann Collet from comment #18)
 This issue makes me wonder : how to efficiently access unaligned memory ?
 
 
 The case in point is ARM cpu.
 They don't support SSE/AVX, so they seem unaffected by this specific issue,
 but this issue force writing the source code in a certain way, to remain
 compatible with vectorizer assumtion.

Just one comment here (sorry for speaking out of turn)

Modern ARM has __ARM_FEATURE_UNALIGNED, which means the processor tolerates
unaligned access. However, I believe it runs afoul of the C/C++ standard and
GCC aliasing rules.

 Therefore, for portable code, it becomes an issue :
 how to write a code which is both portable and efficient on both targets ?

I've been relying on intrinsics to side step C/C++ requirements. In the absence
of intrinsics, I use inline assembler to avoid the C/C++ language rules.

Now, I could be totally wrong, but I don't feel I'm violating the C/C++
language rules until a write a C/C++ statement that performs the violation.
Hence the reason I use intrinsics or drop into assembler.


[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2015-07-13 Thread noloader at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

Jeffrey Walton noloader at gmail dot com changed:

   What|Removed |Added

 CC||noloader at gmail dot com

--- Comment #14 from Jeffrey Walton noloader at gmail dot com ---
(In reply to Jakub Jelinek from comment #10)
 (In reply to Yann Collet from comment #9)
  Looking at the assembler generated, we see that GCC generates a MOVDQA
  instruction for it.
   movdqa (%rdi,%rax,1),%xmm0
   $rdi=0x7fffea4b53e6
   $rax=0x0
  
  This seems wrong on 2 levels :
  
  - The function only wants to copy 8 bytes. MOVDQA works on a full SSE
  register, which is 16 bytes. This spell troubles, if only for buffer
  boundaries checks : the algorithm uses 8 bytes because it knows it can
  safely read/write that size without crossing buffer limits. With 16 bytes,
  no such guarantee.
 
 The function has been inlined into the callers, like:
   do { LZ4_copy8(d,s); d+=8; s+=8; } while (de);
 and this loop is then vectorized.  The vectorization prologue of course has
 to adjust if s is not 16 byte aligned, but it can assume that both s and d
 are 8 byte aligned (otherwise it is undefined behavior)...
Forgive my barging in Jakub. I was referred to this issue and comment from
another issue.

Its not clear to me where the leap is made that its OK use vmovdqa. Are you
stating (unequivocally, for folks like me) that is does *not* matter what the
alignment is in the C-sources; and that the prologue ensures both 's' and 's'
are eventually 16-byte aligned when vmovdqa is invoked. That is, when we see
vmovdqa used, we know the alignment is correct (and at 16-bytes).

Sorry to have to ask.


[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2015-07-13 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #15 from Jakub Jelinek jakub at gcc dot gnu.org ---
I'm saying that if the program does not trigger undefined behavior (e.g. by
accessing misaligned integers without telling the compiler about it (by using
memcpy, or packed attribute or pragma), then it would be a compiler bug to use
an instruction requiring higher alignment than guaranteed in the source,
without ensuring such alignment (through realigning arrays, introducing a loop
for aligning pointers before the vectorized loop, peeling a few iterations
needed to align the pointer(s), or using instructions that don't require such
high alignment).
No testcase has been provided here without having undefined behavior in them
that would show a bug on the compiler side.


[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2015-07-13 Thread noloader at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #16 from Jeffrey Walton noloader at gmail dot com ---
(In reply to Jakub Jelinek from comment #15)
 I'm saying that if the program does not trigger undefined behavior (e.g. by
 accessing misaligned integers without telling the compiler about it (by
 using memcpy, or packed attribute or pragma), then it would be a compiler
 bug to use an instruction requiring higher alignment than guaranteed in the
 source, without ensuring such alignment (through realigning arrays,
 introducing a loop for aligning pointers before the vectorized loop, peeling
 a few iterations needed to align the pointer(s), or using instructions that
 don't require such high alignment).
 No testcase has been provided here without having undefined behavior in them
 that would show a bug on the compiler side.

OK, so you'll have to forgive my ignorance again.

So you are saying that it may be a bug to use vmovdqa if the source and/or
destination are not 16-byte aligned; but all the user code you have seen has
undefined behavior so you're not going to answer. Is that correct?

(My apologies if its too sharp a point. I'm just trying to figure out what the
preconditions are for vmovdqa, and if it should be used when source or
destination is 8-byte aligned).


[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2015-07-13 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #17 from Jakub Jelinek jakub at gcc dot gnu.org ---
(In reply to Jeffrey Walton from comment #16)
 OK, so you'll have to forgive my ignorance again.
 
 So you are saying that it may be a bug to use vmovdqa if the source and/or
 destination are not 16-byte aligned; but all the user code you have seen has
 undefined behavior so you're not going to answer. Is that correct?
 
 (My apologies if its too sharp a point. I'm just trying to figure out what
 the preconditions are for vmovdqa, and if it should be used when source or
 destination is 8-byte aligned).

I'm saying we as the compiler writers know what we are doing, and the various
cases like using unaligned accesses or peeling for alignment or versioning for
alignment, or realigning arrays are handled in the compiler.
They do assume that the source is valid and does not trigger undefined
behavior.
If you e.g. compile on x86_64 with -O3 -mavx2
void
foo (int *a, int *b)
{
  int i;
  for (i = 0; i  64; i++)
a[i] = 2 * b[i];
}
you'll see that compiler decided to peel for alignment of b pointer and you can
see an (unrolled) scalar loop first that handles first few iterations to align
b if it is not already aligned, and then the main vector loop uses
vmovdqa for loads and vmovups for stores (because the a pointer modulo 32 might
not be the same as b pointer modulo 32).  If you compile with -O2
-ftree-vectorize -mavx2, you'll see that peeling for alignment isn't performed,
as it enlarges the code, and vmovdqu is used for the loads instead.
The peeling for alignment assumes that there is no undefined behavior
originally, so if you call this with (uintptr) b % sizeof (int) != 0, it will
not work properly, but that is a bug in the code, not in the compiler.
So, if you have some testcase where there is no undefined behavior triggered
(try various sanitizers, inspect the code yourself, read the C standard), and
you are convinced the compiler introduces a bug where there isn't originally
(i.e. miscompiles the code), feel free to file a new bugreport.
Nothing like that has been presented in this PR.


[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2015-04-09 Thread trippels at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

Markus Trippelsdorf trippels at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID

--- Comment #6 from Markus Trippelsdorf trippels at gcc dot gnu.org ---
(In reply to Richard Biener from comment #5)
 *(U64*)dstPtr = *(U64*)srcPtr; 
 
 makes GCC assume that dstPtr and srcPtr are suitably aligned for U64, if they
 are not then you invoke undefined behavior.  As x86 doesn't trap on
 unaligned accesses unless they are from vectorized code this shows up only
 when the
 vectorizer exploits that alignment info.
 
 Thus I'd say this bug is invalid.

Agreed, closing.


[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2015-04-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

Richard Biener rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 Target||x86_64-*-*
   Target Milestone|--- |5.0

--- Comment #5 from Richard Biener rguenth at gcc dot gnu.org ---
*(U64*)dstPtr = *(U64*)srcPtr; 

makes GCC assume that dstPtr and srcPtr are suitably aligned for U64, if they
are not then you invoke undefined behavior.  As x86 doesn't trap on unaligned
accesses unless they are from vectorized code this shows up only when the
vectorizer exploits that alignment info.

Thus I'd say this bug is invalid.


[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2015-04-09 Thread e...@coeus-group.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #1 from Evan Nemerson e...@coeus-group.com ---
Created attachment 35267
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=35267action=edit
preprocessed test case


[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2015-04-09 Thread trippels at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

Markus Trippelsdorf trippels at gcc dot gnu.org changed:

   What|Removed |Added

 CC||trippels at gcc dot gnu.org

--- Comment #2 from Markus Trippelsdorf trippels at gcc dot gnu.org ---
markus@x4 tmp % gcc -fsanitize=undefined -O3 test.i
markus@x4 tmp % ./a.out sum.lz4
test.c:285:29: runtime error: load of misaligned address 0x7efefb79d001 for
type 'const void', which requires 8 byte alignment
0x7efefb79d001: note: pointer points here
 00 00 00  85 7f 45 4c 46 01 02 01  00 01 00 f1 04 02 00 02  00 00 00 01 00 01
0b 98  00 00 00 34 00
  ^ 
test.c:200:16: runtime error: load of misaligned address 0x7efefb79d009 for
type 'const void', which requires 2 byte alignment
0x7efefb79d009: note: pointer points here
 01 02 01  00 01 00 f1 04 02 00 02  00 00 00 01 00 01 0b 98  00 00 00 34 00 00
91 50  1c 00 f1 00 34
  ^ 
test.c:285:27: runtime error: store to misaligned address 0x01205c31 for
type 'void', which requires 8 byte alignment
0x01205c31: note: pointer points here
 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00
00 00  00 00 00 00 00
  ^ 
test.c:285:27: runtime error: store to misaligned address 0x01205c44 for
type 'void', which requires 8 byte alignment
0x01205c44: note: pointer points here
  00 00 91 50 1c 00 f1 00  34 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00
00 00 00 00 00 00 00
  ^ 
test.c:285:27: runtime error: store to misaligned address 0x01205c4c for
type 'void', which requires 8 byte alignment
0x01205c4c: note: pointer points here
  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00
00 00 00 00 00 00 00
  ^ 
test.c:285:29: runtime error: load of misaligned address 0x01205c3c for
type 'const void', which requires 8 byte alignment
0x01205c3c: note: pointer points here
  00 01 0b 98 00 00 00 34  00 00 91 50 00 00 00 00  00 34 00 20 00 05 00 28  00
1a 00 17 00 00 00 06
  ^ 
test.c:285:29: runtime error: load of misaligned address 0x01205c44 for
type 'const void', which requires 8 byte alignment
0x01205c44: note: pointer points here
  00 00 91 50 00 00 00 00  00 34 00 20 00 05 00 28  00 1a 00 17 00 00 00 06  00
00 00 34 00 00 91 50
  ^ 
test.c:273:25: runtime error: load of misaligned address 0x01205c62 for
type 'const void', which requires 4 byte alignment
0x01205c62: note: pointer points here
 00 34  00 00 00 00 00 00 04 00  11 05 00 20 00 05 00 00  00 00 00 00 00 00 00
00  00 00 00 00 00 00
  ^ 
test.c:273:23: runtime error: store to misaligned address 0x01205c66 for
type 'void', which requires 4 byte alignment
0x01205c66: note: pointer points here
 00 00 00 00 04 00  11 05 00 20 00 05 00 00  00 00 00 00 00 00 00 00  00 00 00
00 00 00 00 00  00 00
 ^ 
test.c:285:27: runtime error: store to misaligned address 0x0120f177 for
type 'void', which requires 8 byte alignment
0x0120f177: note: pointer points here
 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  81 6e 01 00
00 00 00 00  00 00 00
 ^


[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2015-04-09 Thread trippels at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #4 from Markus Trippelsdorf trippels at gcc dot gnu.org ---
If you change:

4309 static void LZ4_copy8(void* dstPtr, const void* srcPtr)
4310 {  
4311
4312 if (1) 
4313 {  
4314 if (LZ4_64bits())  
4315 *(U64*)dstPtr = *(U64*)srcPtr; 
4316 else   
4317 ((U32*)dstPtr)[0] = ((U32*)srcPtr)[0], 
4318 ((U32*)dstPtr)[1] = ((U32*)srcPtr)[1]; 
4319 return;
4320 }  
4321
4322 memcpy(dstPtr, srcPtr, 8); 
4323 } 

to if (0) it will work.

And looking at the lz4 git repository I see:

 269 static void LZ4_copy8(void* dstPtr, const void* srcPtr)
 270 {  
 271 #if GCC_VERSION!=409  /* disabled on GCC 4.9, as it generates invalid
opcode (crash) */   
 272 if (LZ4_UNALIGNED_ACCESS)  
 273 {  
 274 if (LZ4_64bits())  
 275 *(U64*)dstPtr = *(U64*)srcPtr; 
 276 else   
 277 ((U32*)dstPtr)[0] = ((U32*)srcPtr)[0], 
 278 ((U32*)dstPtr)[1] = ((U32*)srcPtr)[1]; 
 279 return;
 280 }  
 281 #endif 
 282 memcpy(dstPtr, srcPtr, 8); 
 283 }


[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2015-04-09 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #8 from Jakub Jelinek jakub at gcc dot gnu.org ---
Author: jakub
Date: Thu Apr  9 19:51:08 2015
New Revision: 221958

URL: https://gcc.gnu.org/viewcvs?rev=221958root=gccview=rev
Log:
PR tree-optimization/65709
* ubsan.c (instrument_mem_ref): Use TREE_TYPE (base) instead of
TREE_TYPE (TREE_TYPE (t)).

* c-c++-common/ubsan/align-9.c: New test.

Added:
trunk/gcc/testsuite/c-c++-common/ubsan/align-9.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/ubsan.c


[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2015-04-09 Thread trippels at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #3 from Markus Trippelsdorf trippels at gcc dot gnu.org ---
BTW clang's output is more informative:

test.c:285:29: runtime error: load of misaligned address 0x7fe669eee001 for
type 'U64' (aka 'unsigned long'), which requires 8 byte alignment
0x7fe669eee001: note: pointer points here
memory cannot be printed
test.c:200:16: runtime error: load of misaligned address 0x7fe669eee009 for
type 'U16' (aka 'unsigned short'), which requires 2 byte alignment
0x7fe669eee009: note: pointer points here
 01 02 01  00 01 00 f1 04 02 00 02  00 00 00 01 00 01 0b 98  00 00 00 34 00 00
91 50  1c 00 f1 00 34
  ^ 
test.c:285:13: runtime error: store to misaligned address 0x02020021 for
type 'U64' (aka 'unsigned long'), which requires 8 byte alignment
0x02020021: note: pointer points here
 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00
00 00  00 00 00 00 00
  ^ 
test.c:273:25: runtime error: load of misaligned address 0x02020052 for
type 'U32' (aka 'unsigned int'), which requires 4 byte alignment
0x02020052: note: pointer points here
 00 34  00 00 00 00 00 00 04 00  11 05 00 20 00 05 00 00  00 00 00 00 00 00 00
00  00 00 00 00 00 00
  ^ 
test.c:273:9: runtime error: store to misaligned address 0x02020056 for
type 'U32' (aka 'unsigned int'), which requires 4 byte alignment
0x02020056: note: pointer points here
 00 00 00 00 04 00  11 05 00 20 00 05 00 00  00 00 00 00 00 00 00 00  00 00 00
00 00 00 00 00  00 00
 ^ 

e.g. type 'const void' vs. type 'U32' (aka 'unsigned int')


[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2015-04-09 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

--- Comment #10 from Jakub Jelinek jakub at gcc dot gnu.org ---
(In reply to Yann Collet from comment #9)
 Looking at the assembler generated, we see that GCC generates a MOVDQA
 instruction for it.
  movdqa (%rdi,%rax,1),%xmm0
  $rdi=0x7fffea4b53e6
  $rax=0x0
 
 This seems wrong on 2 levels :
 
 - The function only wants to copy 8 bytes. MOVDQA works on a full SSE
 register, which is 16 bytes. This spell troubles, if only for buffer
 boundaries checks : the algorithm uses 8 bytes because it knows it can
 safely read/write that size without crossing buffer limits. With 16 bytes,
 no such guarantee.

The function has been inlined into the callers, like:
  do { LZ4_copy8(d,s); d+=8; s+=8; } while (de);
and this loop is then vectorized.  The vectorization prologue of course has to
adjust if s is not 16 byte aligned, but it can assume that both s and d are 8
byte aligned (otherwise it is undefined behavior).  So, if they aren't 8 byte
aligned, you could get crashes etc.  The load is then performed as aligned,
because the vectorization prologue ensured it is aligned (unless the program
has undefined behavior), while the stores as done using movups because it is
possible the pointers aren't both aligned the same.

 - MOVDQA requires both positions to be aligned.
 I read it as being SSE size aligned, which means 16-bytes aligned.
 But they are not, these pointers are supposed to be 8-bytes aligned only.
 
 (A bit off topic, but from a general perspective, I don't understand the use
 of MOVDQA, which requires such a strong alignment condition, while there is
 also MOVDQU available, which works fine at any memory address, while
 suffering no performance penalty on aligned memory addresses. MOVDQU looks
 like a better choice in every circumstances.)

On most CPUs there is a significant performance difference between the two,
even if you use MOVDQU with aligned addresses.


[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2015-04-09 Thread yann.collet.73 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

Yann Collet yann.collet.73 at gmail dot com changed:

   What|Removed |Added

 CC||yann.collet.73 at gmail dot com

--- Comment #9 from Yann Collet yann.collet.73 at gmail dot com ---
While the issue can be easily fixed from an LZ4 perspective, 
the main topic here is to analyze a GCC 4.9+ vectorizer choice.

The piece of code that it tried to optimize can be summarized as follows (once
removed all the garbage) :

static void LZ4_copy8(void* dstPtr, const void* srcPtr)
{
   *(U64*)dstPtr = *(U64*)srcPtr;
}

Pretty simple.
Let's assume for the rest of the post that both pointers are correctly aligned,
so it's not a problem anymore.

Looking at the assembler generated, we see that GCC generates a MOVDQA
instruction for it.
 movdqa (%rdi,%rax,1),%xmm0
 $rdi=0x7fffea4b53e6
 $rax=0x0

This seems wrong on 2 levels :

- The function only wants to copy 8 bytes. MOVDQA works on a full SSE register,
which is 16 bytes. This spell troubles, if only for buffer boundaries checks :
the algorithm uses 8 bytes because it knows it can safely read/write that size
without crossing buffer limits. With 16 bytes, no such guarantee.

- MOVDQA requires both positions to be aligned.
I read it as being SSE size aligned, which means 16-bytes aligned.
But they are not, these pointers are supposed to be 8-bytes aligned only.

(A bit off topic, but from a general perspective, I don't understand the use of
MOVDQA, which requires such a strong alignment condition, while there is also
MOVDQU available, which works fine at any memory address, while suffering no
performance penalty on aligned memory addresses. MOVDQU looks like a better
choice in every circumstances.)

Anyway, the core of the issue is rather above :
this is just an 8-bytes copy operation, replacing by a 16-bytes one looks
suspicious. Maybe it would deserve a look.


[Bug tree-optimization/65709] [5 Regression] Bad code for LZ4 decompression with -O3 on x86_64

2015-04-09 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

Jakub Jelinek jakub at gcc dot gnu.org changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #7 from Jakub Jelinek jakub at gcc dot gnu.org ---
Created attachment 35271
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=35271action=edit
gcc5-pr65709.patch

For the bad wording in the -fsanitize=alignment diagnostics, here is a fix. 
The type of MEM_REF's argument isn't relevant, it can be anything, what really
matters is the type of the MEM_REF, which is the access type.