Hi folks
I was reviewing the trunk MTT results tonight and found a ton of
failures in the Intel test suite on IU's odin cluster. That cluster -
usually- runs pretty clean, so I took a closer look.
What I found was that the errors were all typified by the following:
MPITEST_INFO ( 0): Starting test MPI_Allgather()
[odin001:31038] *** Process received signal ***
[odin001:31038] Signal: Floating point exception (8)
[odin001:31038] Signal code: Integer divide-by-zero (1)
[odin001:31038] Failing at address: 0x804c8c9
[odin001:31039] *** Process received signal ***
[odin001:31039] Signal: Floating point exception (8)
[odin001:31039] Signal code: Integer divide-by-zero (1)
[odin001:31039] Failing at address: 0x804c8c9
[odin001:31040] *** Process received signal ***
[odin001:31040] Signal: Floating point exception (8)
[odin001:31040] Signal code: Integer divide-by-zero (1)
[odin001:31040] Failing at address: 0x804c8c9
[odin001:31038] [ 0] [0xffffe600]
[odin001:31038] [ 1] src/MPI_Allgather_f(MAIN__+0x2db) [0x804b30f]
[odin001:31038] [ 2] src/MPI_Allgather_f(main+0x27) [0x805aa57]
[odin001:31038] [ 3] /lib/libc.so.6(__libc_start_main+0xdc) [0xf7c32dec]
[odin001:31038] [ 4] src/MPI_Allgather_f [0x804af81]
[odin001:31038] *** End of error message ***
In other words, a divide-by-zero floating point exception on a
collective test.
Any ideas what might be causing this?
Ralph