In case it helps, I've seen GCC produce different code for function A
when I change function B. This can happen even if function B is not in
the (reasonably close) execution path of function A. However, the
differences I've seen are along the lines of e.g.: using %r10 instead of
%r9, or %r13+3 instead of %r12+4 (where the pointers basically point to
the same desired data). Since both outputs are legal, the differences
can be attributed to the internal state of the compiler which perhaps is
not totally deterministic. Nevertheless, as long as the output is
valid, one should not be able to complain about this variability.
Nevertheless, these changes do expose other bugs. So, if you see these
kind of differences correlated with "the VM crashes" or "the VM appears
to work", then I'd suspect ABI violations with regards to the registers
involved.
On 2/1/11 22:17 , Eliot Miranda wrote:
Hi All,
you may already know that there have been strange stability
problems with the Cog VM on linux. Problems with the heartbeat appear
to derive from specific compilations, one compilation of the same source
producing an executable that will crash, another producing one that
won't. recent testing at Teleplace showed that an effect due to what
was presumed to be a compiler bug (specifically the optimization level
used to compile the heartbeat, high causing a crash) was not repeatable.
So today in building new production VMs for Teleplace I decided to do
three parallel linux builds and see if all produced the same results.
While there are macros used in the source that are date dependent (use
of __DATE__) AFAIA there are none apart from version.c/version.o that
depend on time, and no timestamps or current directory paths in linux
objects, and so, provided different compilations of the same source are
done on the same day, the results should be bit-identical. In my
experiment this turns out not to be the case, which is more than a
little alarming.
What I'm seeing is different results duplicating unixbuild/bld to
unixbuild/bldb and unixbuild/bldc, doing identical configures and makes
in each of the three directories and then comparing resulting objects.
I see this in a bare metal laptop with local sources running CERN SLC5
and on a Parallels VM running CentOS 5.3 (both derived from RHEL). I'm
using gcc 4.1.2. Here's a script that shows example differences:
bld$ for f in *.o vm/*.o; do echo $f;cmp $f ../bldb/$f; cmp $f
../bldc/$f; done
disabledPlugins.o
disabledPlugins.o ../bldb/disabledPlugins.o differ: byte 200, line 4
disabledPlugins.o ../bldc/disabledPlugins.o differ: byte 200, line 4
version.o
version.o ../bldb/version.o differ: byte 166, line 3
version.o ../bldc/version.o differ: byte 166, line 3
vm/aio.o
vm/cogit.o
vm/debug.o
vm/gcc3x-cointerp.o
vm/osExports.o
vm/sqExternalSemaphores.o
vm/sqHeapMap.o
vm/sqLinuxHeartbeat.o
vm/sqLinuxWatchdog.o
vm/sqLinuxWatchdog.o ../bldb/vm/sqLinuxWatchdog.o differ: byte 33, line 1
vm/sqLinuxWatchdog.o ../bldc/vm/sqLinuxWatchdog.o differ: byte 33, line 1
vm/sqNamedPrims.o
vm/sqNamedPrims.o ../bldb/vm/sqNamedPrims.o differ: byte 6346, line 30
vm/sqNamedPrims.o ../bldc/vm/sqNamedPrims.o differ: byte 6346, line 30
vm/sqTicker.o
vm/sqUnixCharConv.o
vm/sqUnixExternalPrims.o
vm/sqUnixMain.o
vm/sqUnixMain.o ../bldb/vm/sqUnixMain.o differ: byte 31415, line 170
vm/sqUnixMain.o ../bldc/vm/sqUnixMain.o differ: byte 31414, line 170
vm/sqUnixMemory.o
vm/sqUnixThreads.o
vm/sqUnixVMProfile.o
vm/sqVirtualMachine.o
Using objdump --disassemble I can see for example that sqLinuxWatchdog.o
and sqUnixMain.o differ only in the symbol table, not the executable
code. So perhaps this is not meaningful, and merely noise. But with
simple files like disabledPlugins.c that different objects are produced
at all in different runs is rather worrying:
bld$ cat disabledPlugins.c
/* this should be in a header file, but it isn't. ho hum. */
typedef struct {
char *pluginName;
char *primitiveName;
void *primitiveAddress;
} sqExport;
sqExport vm_display_Quartz_exports[] = { 0, 0, 0 };
sqExport vm_display_custom_exports[] = { 0, 0, 0 };
sqExport vm_display_fbdev_exports[] = { 0, 0, 0 };
sqExport vm_sound_MacOSX_exports[] = { 0, 0, 0 };
sqExport vm_sound_NAS_exports[] = { 0, 0, 0 };
sqExport vm_sound_OSS_exports[] = { 0, 0, 0 };
sqExport vm_sound_Sun_exports[] = { 0, 0, 0 };
sqExport vm_sound_custom_exports[] = { 0, 0, 0 };
I wonder
- do you see the same effect?
- does this happen with gcc versions other than 4.1.2?
- does it happen on non-RHEL-derived distros?
- is this a meaningful signal or just harmless noise?
- what am I doing wrong?
Clearly I need to look more carefully but I thought I'd ask y'all in
order to understand and hopefully solve the build instabilities as
swiftly as possible.
If you do want to try and reproduce this simply duplicate the build
directory (unixbuild/bld in the Cog VM source) twice and do three
separate configures and makes, one in each of the build directories,
each from the same source code. Then run some variation fo the script
above to compare the object files so produced.
best
Eliot