> The symptom of the heisenbug is having the uniquieId field modified so
> that the second half is exactly the same as the first half.  Thorough
> examination of any code that touches that field finds nothing that
> even modifies it, much less copies over half the bits.  Our best idea
> as to the cause is a JVM but in the JIT compiling that some JVMs do.
> The fact that the corruption only happens occasionally after the node
> has been running for a while makes even building a workaround (other
> than what we've done already) all the much harder.

The reason is very clear to anyone who knows the Java language spec.
In short, it's a race condition: long and double variable accesses are
not atomic and need synchronization. Whether it really occurs is
VM-specific but not a bug.

This is from a mail I sent to this list last year but apparently was
swallowed:

> the request id, leaving only a JVM bug to blame.  The fact that this
> bug occurs exclusively in nodes running IBM's JVM, and doesn't occur
> when JIT compilation is disabled, forces us to conclude it's a problem
> with the JVM.

And you are really sure that it's not a bug in the Java program?
This is a long variable (long long does not exist in Java), and the
JLS (17.4) specifically states that accesses to longs are not atomic
and thus have to be synchronized.

Take the following example:

public class Race extends Thread
{
    static long cnt = 0L;

    static void update() {
        cnt += 0x100000001L;
    }

    public void run() {
        for (int i=0; i<65536; ++i) {
            update();
        }
        System.out.println(Long.toHexString(cnt));
    }

    public static void main(String[] args) {
        for (int i=0; i<1000; ++i) {
            new Race().start();
        }
    }
}

This should generate only "cnt" values with the high and low word
equal. Which it does e.g. on this VM:
 java version "1.2"
 Classic VM (build Linux_JDK_1.2_pre-release-v2, native threads, sunwjit)
but not on this one:
 java version "1.4.1-beta"
 Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1-beta-b14)
 Java HotSpot(TM) Client VM (build 1.4.1-beta-b14, mixed mode)
(both from Sun/Blackdown under Linux).

The accesses on the high and low 32-bit words of cnt may interleave
between threads. This way it is possible for this variable to acquire
garbage values even though it is only accessed by proper manipulations
of the "long" value. Whether this corruption actually occurs is
implementation specific, but it _may_ happen, and so the long
variables have to be protected properly (e.g. making "update"
synchronized in the above example or declaring all longs volatile).

The fact that almost all Heisenbug occurrences reported here are from
the same two types of VM, one of which (Sun 1.4 under Linux) exhibits
the unsynchronized behaviour, strongly suggests that this is indeed
the reason.

In short, the bug is in fred. Finding it may be hard though.

Olaf

_______________________________________________
devl mailing list
[EMAIL PROTECTED]
http://hawk.freenetproject.org:8080/cgi-bin/mailman/listinfo/devl

Reply via email to