Hi Pablo,
pointer aliasing is considered UB indeed
and accessing a pointer outside of allocated bounds also is UB

the recommended way to perform a between one pointer type and another
(reinterpret_cast) is to use memcpy,
so it should be:

    long v;
    int i[sizeof(long)/sizeof(int)];

    memcpy( i , &v, sizeof(v) );
    return i[index];

Normally, the compiler -O2 is smart enough to not allocate memory for i[2]
and can perform all ops thru registers


Le mer. 11 déc. 2019 à 18:01, Aliaksei Syrel <[email protected]> a
écrit :

> Hi Pablo,
>
> Wow! Thank you for the detective story :)
>
> Do I understand correctly that the original code causes undefined behavior
> and therefore can be changed (or even removed) by the compiler?
> (because it returns something that is referencing memory on the stack)
>
> Please keep posting similar things in future! It is very educative :)
>
> Cheers,
> Alex
>
>
> On Wed, 11 Dec 2019 at 17:35, [email protected] <[email protected]> wrote:
>
>> Hi,
>>     this mail is related to Pharo because it is knowledge I found
>> debugging the build of the VM, but the rest is to document it and
>> perhaps someone will found it interesting (also I couldn't find it
>> easily using Google). Sorry for the long mail!
>>
>> The problem
>> ==========
>>
>> The following code does not produce good code in 8.3 when using
>> optimizations:
>>
>> long __attribute__ ((noinline)) myFunc(long i, int index){
>>    long v;
>>    long x = i >> 3;
>>
>>    v = x;
>>    return ((int*)(&v))[index];
>> }
>>
>> #include <stdio.h>
>>
>> int main(){
>>
>>     long i;
>>     int x;
>>
>>     scanf("%ld", &i);
>>     scanf("%d", &x);
>>
>>     printf("%ld",myFunc(i,x));
>> }
>>
>> Basically, with -02, it generates the following code:
>>
>> myFunc:
>>      movslq %esi, %rsi
>>      movslq -8(%rsp,%rsi,4), %rax
>>      ret
>>
>> And with -01 it generates the following code:
>>
>> myFunc:
>>      sarq $3, %rdi
>>      movq %rdi, -8(%rsp)
>>      movslq %esi, %rsi
>>      movslq -8(%rsp,%rsi,4), %rax
>>      ret
>>
>> As you can see, the more optimized version is losing the bit shift (or
>> any other operation).
>> To detect the problem it was not easy, basically, I have used the
>> Pharo Tests to detect the failing function and then to understand the
>> generation of code in GCC, we need to debug its generation.
>>
>> The first snippet is generated using:
>>
>> gcc -O2 prb.c -S -o prb.s -Wall
>>
>> and the second using:
>>
>> gcc -O1 prb.c -S -o prb.s -Wall
>>
>> The GCC pipeline has different steps, I have centered my self in the
>> tree optimizations.
>> GCC dumps each of the intermediate states of the compilation, working
>> with C like trees. For example to get the optimized version of the
>> program, before generating the Assembler we have to do:
>>
>> gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-optimized=/dev/stdout
>>
>> This will generate:
>>
>> __attribute__((noinline))
>> myFunc (long int i, int index)
>> {
>>   long int v;
>>   long unsigned int _1;
>>   long unsigned int _2;
>>   int * _3;
>>   int _4;
>>   long int _8;
>>
>>   <bb 2> [local count: 1073741825]:
>>   _1 = (long unsigned int) index_7(D);
>>   _2 = _1 * 4;
>>   _3 = &v + _2;
>>   _4 = *_3;
>>   _8 = (long int) _4;
>>   v ={v} {CLOBBER};
>>   return _8;
>>
>> }
>>
>> This is the optimized SSA (static single assign) version of the
>> function (https://gcc.gnu.org/onlinedocs/gccint/SSA.html)
>> As you can see in this version the code is already optimized, and our
>> operations not correctly generated. We expect to see something like:
>>
>> __attribute__((noinline))
>> myFunc (long int i, int index)
>> {
>>   long int x;
>>   long int v;
>>   long unsigned int _1;
>>   long unsigned int _2;
>>   int * _3;
>>   int _4;
>>   long int _10;
>>
>>   <bb 2> [local count: 1073741825]:
>>   x_6 = i_5(D) >> 3;
>>   v = x_6;
>>   _1 = (long unsigned int) index_9(D);
>>   _2 = _1 * 4;
>>   _3 = &v + _2;
>>   _4 = *_3;
>>   _10 = (long int) _4;
>>   v ={v} {CLOBBER};
>>   return _10;
>>
>> }
>>
>> In the first SSA version, we are lacking the shift operation:
>>
>>   x_6 = i_5(D) >> 3;
>>   v = x_6;
>>
>> So, we need to see in which of the optimizations and transformations
>> our code is lost:
>> To see all the steps we should execute:
>>
>> gcc -O2 prb.c -S -o prb.s -Wall -fdump-tree-all
>>
>> This will generate a lot, really a lot, of files in the same directory.
>> They have the name: prb.c.xxx.yyyy
>> Where xxx is the number of the step, and yyyy is the name of the step.
>> Each of the files contains the result of applying the changes, so what
>> I have done is looking in binary search where my code was lost.
>>
>> I have found that the problem was in the first dead code elimination
>> step (prb.c.041t.cddce1)
>> GCC does not like the crap code that we are writing, as we are copying
>> a stack variable and then accessing directly to the memory:
>>
>> v = x;
>> return ((int*)(&v))[index];
>>
>> So, basically, it was assuming that we were only using v and not x, so
>> all the code modifying x is described (this optimization is called
>> "dead store code deletion"). Basically tries to remove all the code
>> that is affecting stack (temporary) variables that are never used.
>>
>> I am not sure if this is a bug in GCC, so I will report it. But I have
>> fixed the problem writing the code in a proper way.
>>
>> Sorry again for the long mail, I wanted to store the knowledge and
>> propagate it before I eventually will flush it. I like these things,
>> but I am not debugging the GCC optimization pipeline all the days.
>>
>> --
>> Pablo Tesone.
>> [email protected]
>>
>>

Reply via email to