Hi All,

   One of the bugs that has been holding up the HDF5 1.8.6 release appears
to be an MPI and/or file system bug.  We believe we have re-produced it on
NCSA's Abe with an MPI program (tmpi.c).

   Two requests for help:

   First, we would appreciate it if those of you who are conversant with
MPI would take a look at tmpi.c (see below), and let us know if you see
any problems with it in terms of correctness.  We think it is correct, 
but MPI can be slippery so extra eyes would be useful.

   Second, we would like to know just how wide spread an issue we are dealing
with.  We know it is a problem on NCSA's Abe, and it may be a problem on TACC's 
Ranger as well.  If you are able run tmpi.c on other machines and report 
positive or negative results, that would give us a better idea of the scope of 
the problem.

   An outline of tmpi.c follows, along with a description of how the failure
can be exposed on Abe.  For similar systems, testing with a similar protocol
should be sufficient.  In other cases, some experimentation may be required.
For example, we didn't see the issue on Abe until we ran with more than one 
process per node.  In you reports, please let us know what flavors of MPI
and file system (GPFS, LUSTRE, etc) the target machine uses.  If you succeed
in exposing the failure, please let us know exactly how you did it.

   Finally, the code for tmpi.c appears later in this email, followed by 
sample output from Abe and Ranger.

   Do let us know if you can help on either front.

                                               Many thanks,

                                               John Mainzer

======================== description of tmpi.c =========================

   Briefly, tmpi.c consist of a loop in which all processes 
proceed as follows.  Note that the particulars of synchronization are 
controlled by the SET_ATOMICITY and REOPEN #defines.  The program fails
in the same way regardless of whether SET_ATOMICITY and/or REOPEN are 
TRUE.

         1) Barrier

         2) Open the test file

            If SET_ATOMICITY is TRUE, call MPI_File_set_atomicity()

         3) Participate in a collective write of an integer vector 
            to the file.  Each process writes 10 integers starting
            at index (mpi_rank * 10).  Each integer is set equal to 
            its index in the vector.

         4) if REOPEN is TRUE

                close test file/barrier/open test file

                if SET_ATOMICITY is TRUE, call MPI_File_set_atomicity()

            else if neither SET_ATOMICITY nor REOPEN is TRUE

                Sync/Barrier/Sync

         5) Participate in a collective read of the integer vector 
            from file.  Each process reads the entire vector.

         6) Verify that the vector contains the expected data.  If it 
            does not, each process issues an error message.  In addition,
            process 0 dumps the contents of the vector to stdout, and 
            also prints the contents of the vector as an ASCII string 
            starting at the point at which the data differs from the 
            expected values.

         7) if REOPEN is TRUE

                close test file/barrier/open test file

                if SET_ATOMICITY is TRUE, call MPI_File_set_atomicity()

            else if neither SET_ATOMICITY nor REOPEN is TRUE

                Sync/Barrier/Sync

         8) Construct an array of 80 characters containing the string:

                "Independent write x/y."

            followed by null characters to the end of the array.  In 
            each string, x is replace by the number of the pass through
            the loop, and y is replaced by the MPI rank of the process.

         9) Perform an independent write of the above array to location 
            (mpi_rank * 80) in the file.

        10) Close the file.
           
The above loop is repeated 100 times.

   In the test code below, you will note that the construction of the derived 
type used in the collective write is somewhat convoluted.  This is done to 
duplicate the behavior of HDF5 under the circumstances in which this issue 
was first detected.

===================== reproducing the failure on Abe =====================

   On Abe, the program only fails if there are more processes than nodes -- my 
tests were on the head nodes of Abe.  The failure appears regularly on runs
with six processes distributed between the four head nodes.

   To compile and run on the head nodes of Abe, first start mpd's on all four
head nodes.  I did this with:

        mpdboot -n 4 -f ~/mpd.hosts

where mpd.hosts contains:

        honest1
        honest2
        honest3
        honest4

To compile and run:

        mpicc tmpi.c
        mpiexec -n 6 ./a.out

   On Abe, the apparent bug is corruption observed in the vector when it
is read from file and compared with the expected values in steps 5 and 6 
above.  As you can see from the sample output, this corruption is occasional.  
In cases where the corruption contains an identifiable string (for exampled,
see iteration 15 in the sample output), it appears to be data from the 
independent writes of the previous iteration.

========================== failure on Ranger ==============================

   While I don't have access to Ranger, a co-worker reports that an earlier
version of tmpi.c fails there as well -- albeit with a crash.  As I didn't 
run the test, I can't speak to the particulars.  However, I have appended the 
output reported to me on the off chance that that it will be useful.

============================ test program tmpi.c ============================
#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <string.h>

#define BLOCK 10
#define NITER 100
#define IND_WRITE_BUF_SIZE      80

/* set to 1 to use MPI_set_file_atomicity */
#define SET_ATOMICITY 0


/* set to 1 to close and reopen the file after writes and reads */
#define REOPEN 0


void construct_file_mpi_datatype(int mpi_rank,
                                 int mpi_size,
                                 int block,
                                 MPI_Datatype * file_type_ptr)
{
    int          block_length[3];
    MPI_Datatype inner_type;    /* Inner MPI Datatype */
    MPI_Datatype outer_type;    /* Inner MPI Datatype */
    MPI_Datatype filetype;      /* MPI File datatype */
    MPI_Datatype old_types[3];
    MPI_Aint     extent_len;
    MPI_Aint     displacement[3];

    /* Create base contiguous type */
    MPI_Type_contiguous(sizeof(int), MPI_BYTE, &inner_type);

    MPI_Type_vector(1, block, 1, inner_type, &outer_type);
    MPI_Type_free(&inner_type);

    MPI_Type_extent(outer_type, &extent_len);

    inner_type = outer_type;

    block_length[0] = 1;
    block_length[1] = 1;
    block_length[2] = 1;

    old_types[0] = MPI_LB;
    old_types[1] = outer_type;
    old_types[2] = MPI_UB;

    displacement[0] = 0;
    displacement[1] = mpi_rank * block * sizeof(int);
    displacement[2] = (mpi_size) * block * sizeof(int);

    MPI_Type_struct(3, block_length, displacement, old_types, &inner_type);

    MPI_Type_free(&outer_type);

    filetype = inner_type;

    MPI_Type_commit(&filetype);

    *file_type_ptr = filetype;

    return;

} /* construct_file_mpi_datatype() */


void do_independant_write(MPI_File fh,
                          int mpi_rank,
                          int mpi_size,
                          int generation,
                          MPI_Offset base_offset)
{
    char         write_buf[IND_WRITE_BUF_SIZE];
    int          i;
    int          success = 1;

    for ( i = 0; i < IND_WRITE_BUF_SIZE; i++ ) {

        write_buf[i] = '\0';
    }

    sprintf(write_buf, "Independent write %d/%d.", generation, mpi_rank);

    assert(strlen(write_buf) < IND_WRITE_BUF_SIZE);

    MPI_File_set_view(fh, 0, MPI_BYTE, MPI_BYTE, "native", MPI_INFO_NULL);

    MPI_File_write_at(fh, 
                      base_offset + (mpi_rank * IND_WRITE_BUF_SIZE), 
                      write_buf, 
                      IND_WRITE_BUF_SIZE, 
                      MPI_BYTE, 
                      MPI_STATUS_IGNORE);

    return;

} /* do_independant_write() */


int main(int argc, char *argv[])
{
    int          *wbuf = NULL;  /* Write buffer */
    int          *rbuf = NULL;  /* Read buffer */
    int          mpi_rank;      /* MPI Rank */
    int          mpi_size;      /* MPI Size */
    int          block = BLOCK;
    MPI_File     fh;            /* File */
    MPI_Datatype filetype;      /* MPI File datatype */
    int          failed = 0;
    int          failure_point;
    int          i, j, k;


    /* Setup */
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

    /* Loop NITER times */
    for(i=0; i<NITER; i++) {

        if ( mpi_rank == 0 ) {

            fprintf(stdout, "Itteration %d: block size == %d.\n", i, block);
        }

        /* construct the file mpi derived type */
        construct_file_mpi_datatype(mpi_rank, mpi_size, block, &filetype);

        /* Allocate buffers */
        /* All processes read the entire file */
        rbuf = (int *)malloc((mpi_size) * block * sizeof(int));

        wbuf = (int *)malloc(block * sizeof(int));

        /* Fill buffer: final file will be simply a series of increasing
         * integers: 0, 1, 2, 3... */
        for(j=0; j<block; j++)
            wbuf[j] = j + (mpi_rank * block);

        /* Barrier */
        MPI_Barrier(MPI_COMM_WORLD);

        /* Open file collectively */
        MPI_File_open(MPI_COMM_WORLD, "tmpi.dat", MPI_MODE_RDWR
                | MPI_MODE_CREATE, MPI_INFO_NULL, &fh);

#if SET_ATOMICITY
        MPI_File_set_atomicity(fh, 1);
#endif

        /* Set the file view */
        MPI_File_set_view(fh, 0, MPI_BYTE, filetype, "native", MPI_INFO_NULL);

        /* Write the data */
        MPI_File_write_at_all(fh, 0, wbuf, (mpi_rank == 0 ? 2 : 1) * block
                * sizeof(int), MPI_BYTE, MPI_STATUS_IGNORE);

#if REOPEN
        MPI_File_close(&fh);
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_File_open(MPI_COMM_WORLD, "tmpi.dat", MPI_MODE_RDWR,
                MPI_INFO_NULL, &fh);
#if SET_ATOMICITY
        MPI_File_set_atomicity(fh, 1);
#endif
#else
#if ( !( REOPEN || SET_ATOMICITY ) )
        /* Sync/Barrier/Sync */
        MPI_File_sync(fh);
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_File_sync(fh);
#endif
#endif

        MPI_File_set_view(fh, 0, MPI_BYTE, MPI_BYTE, "native", MPI_INFO_NULL);

        /* Read the data */
        MPI_File_read_at_all(fh, 0, rbuf, (mpi_size) * block * sizeof(int),
                MPI_BYTE, MPI_STATUS_IGNORE);

        /* Verify the read data */
        failed = 0;
        for(j = 0; !failed && j < (mpi_size) * block; j++)
            if(rbuf[j] != j) {
                failed = 1;
                failure_point = j;
                printf("Rank %d detected error on iteration %d at location 
%d!\n",
                        mpi_rank, i, j);
            }

        if ( ( mpi_rank == 0 ) && ( failed ) ) {

            k = 0;
            fprintf(stdout, "\n");
            for ( j = 0; j < (mpi_size) * block; j++ ) {

                fprintf(stdout, " %d", rbuf[j]);
                k++;
                if ( k >= 10 ) {

                    k = 0;
                    fprintf(stdout, "\n");
                }
            }
            fprintf(stdout, "\n");

            fprintf(stdout, 
               "String representation of receive buffer starting at rbuf[%d]: 
\"%s\"\n\n",
               failure_point, (char *)(&(rbuf[failure_point])));
        }

#if REOPEN
        MPI_File_close(&fh);
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_File_open(MPI_COMM_WORLD, "tmpi.dat", MPI_MODE_RDWR,
                MPI_INFO_NULL, &fh);
#if SET_ATOMICITY
        MPI_File_set_atomicity(fh, 1);
#endif
#else
#if ( ! ( REOPEN || SET_ATOMICITY ) )
        /* Sync/Barrier/Sync */
        MPI_File_sync(fh);
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_File_sync(fh);
#endif
#endif

        do_independant_write(fh, mpi_rank, mpi_size, i, 0);

        MPI_Type_free(&filetype);

        MPI_File_close(&fh);

        free(wbuf);
        free(rbuf);
    }

    MPI_Finalize();

    return 0;
}
============================= sample output from Abe 
============================
[main...@honest1 testpar]$ mpiexec -n 6 ./a.out
Itteration 0: block size == 10.
Itteration 1: block size == 10.
Itteration 2: block size == 10.
Itteration 3: block size == 10.
Itteration 4: block size == 10.
Itteration 5: block size == 10.
Itteration 6: block size == 10.
Itteration 7: block size == 10.
Itteration 8: block size == 10.
Itteration 9: block size == 10.
Itteration 10: block size == 10.
Itteration 11: block size == 10.
Itteration 12: block size == 10.
Itteration 13: block size == 10.
Itteration 14: block size == 10.
Itteration 15: block size == 10.
Rank 0 detected error on iteration 15 at location 40!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 1701080649 1684956528 544501349 1953067639 875634789 3027503 0 0 0 0
 50 51 52 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[40]: "Independent 
write 14/2."

Rank 2 detected error on iteration 15 at location 40!
Rank 5 detected error on iteration 15 at location 40!
Rank 3 detected error on iteration 15 at location 40!
Rank 4 detected error on iteration 15 at location 40!
Rank 1 detected error on iteration 15 at location 40!
Itteration 16: block size == 10.
Rank 0 detected error on iteration 16 at location 53!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 40 41 42 43 44 45 46 47 48 49
 50 51 52 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[53]: ""

Rank 2 detected error on iteration 16 at location 53!
Rank 5 detected error on iteration 16 at location 53!
Rank 3 detected error on iteration 16 at location 53!
Rank 4 detected error on iteration 16 at location 53!
Rank 1 detected error on iteration 16 at location 53!
Itteration 17: block size == 10.
Itteration 18: block size == 10.
Itteration 19: block size == 10.
Itteration 20: block size == 10.
Itteration 21: block size == 10.
Itteration 22: block size == 10.
Itteration 23: block size == 10.
Itteration 24: block size == 10.
Itteration 25: block size == 10.
Itteration 26: block size == 10.
Itteration 27: block size == 10.
Itteration 28: block size == 10.
Itteration 29: block size == 10.
Itteration 30: block size == 10.
Rank 0 detected error on iteration 30 at location 40!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 1701080649 1684956528 544501349 1953067639 959586405 3027503 0 0 0 0
 0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[40]: "Independent 
write 29/2."

Rank 2 detected error on iteration 30 at location 40!
Rank 5 detected error on iteration 30 at location 40!
Rank 3 detected error on iteration 30 at location 40!
Rank 4 detected error on iteration 30 at location 40!
Rank 1 detected error on iteration 30 at location 40!
Itteration 31: block size == 10.
Itteration 32: block size == 10.
Rank 0 detected error on iteration 32 at location 40!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 1701080649 1684956528 544501349 1953067639 825434213 3027503 0 0 0 0
 0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[40]: "Independent 
write 31/2."

Rank 2 detected error on iteration 32 at location 40!
Rank 5 detected error on iteration 32 at location 40!
Rank 3 detected error on iteration 32 at location 40!
Rank 4 detected error on iteration 32 at location 40!
Rank 1 detected error on iteration 32 at location 40!
Itteration 33: block size == 10.
Itteration 34: block size == 10.
Itteration 35: block size == 10.
Rank 0 detected error on iteration 35 at location 50!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 40 41 42 43 44 45 46 47 48 49
 0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 35 at location 50!
Rank 5 detected error on iteration 35 at location 50!
Rank 3 detected error on iteration 35 at location 50!
Rank 4 detected error on iteration 35 at location 50!
Rank 1 detected error on iteration 35 at location 50!
Itteration 36: block size == 10.
Itteration 37: block size == 10.
Rank 0 detected error on iteration 37 at location 50!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 40 41 42 43 44 45 46 47 48 49
 0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 37 at location 50!
Rank 5 detected error on iteration 37 at location 50!
Rank 3 detected error on iteration 37 at location 50!
Rank 4 detected error on iteration 37 at location 50!
Rank 1 detected error on iteration 37 at location 50!
Itteration 38: block size == 10.
Itteration 39: block size == 10.
Itteration 40: block size == 10.
Rank 0 detected error on iteration 40 at location 53!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 40 41 42 43 44 45 46 47 48 49
 50 51 52 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[53]: ""

Rank 2 detected error on iteration 40 at location 53!
Rank 5 detected error on iteration 40 at location 53!
Rank 3 detected error on iteration 40 at location 53!
Rank 4 detected error on iteration 40 at location 53!
Rank 1 detected error on iteration 40 at location 53!
Itteration 41: block size == 10.
Itteration 42: block size == 10.
Itteration 43: block size == 10.
Itteration 44: block size == 10.
Rank 0 detected error on iteration 44 at location 53!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 40 41 42 43 44 45 46 47 48 49
 50 51 52 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[53]: ""

Rank 2 detected error on iteration 44 at location 53!
Rank 3 detected error on iteration 44 at location 53!
Rank 5 detected error on iteration 44 at location 53!
Rank 4 detected error on iteration 44 at location 53!
Rank 1 detected error on iteration 44 at location 53!
Itteration 45: block size == 10.
Itteration 46: block size == 10.
Itteration 47: block size == 10.
Itteration 48: block size == 10.
Itteration 49: block size == 10.
Itteration 50: block size == 10.
Itteration 51: block size == 10.
Rank 0 detected error on iteration 51 at location 50!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 40 41 42 43 44 45 46 47 48 49
 0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 51 at location 50!
Rank 5 detected error on iteration 51 at location 50!
Rank 3 detected error on iteration 51 at location 50!
Rank 4 detected error on iteration 51 at location 50!
Rank 1 detected error on iteration 51 at location 50!
Itteration 52: block size == 10.
Itteration 53: block size == 10.
Itteration 54: block size == 10.
Rank 0 detected error on iteration 54 at location 53!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 40 41 42 43 44 45 46 47 48 49
 50 51 52 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[53]: ""

Rank 2 detected error on iteration 54 at location 53!
Rank 5 detected error on iteration 54 at location 53!
Rank 3 detected error on iteration 54 at location 53!
Rank 4 detected error on iteration 54 at location 53!
Rank 1 detected error on iteration 54 at location 53!
Itteration 55: block size == 10.
Rank 0 detected error on iteration 55 at location 40!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 1701080649 1684956528 544501349 1953067639 875896933 3027503 0 0 0 0
 0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[40]: "Independent 
write 54/2."

Rank 2 detected error on iteration 55 at location 40!
Rank 3 detected error on iteration 55 at location 40!
Rank 5 detected error on iteration 55 at location 40!
Rank 4 detected error on iteration 55 at location 40!
Rank 1 detected error on iteration 55 at location 40!
Itteration 56: block size == 10.
Rank 0 detected error on iteration 56 at location 40!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 1701080649 1684956528 544501349 1953067639 892674149 3027503 0 0 0 0
 0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[40]: "Independent 
write 55/2."

Rank 2 detected error on iteration 56 at location 40!
Rank 1 detected error on iteration 56 at location 40!
Rank 5 detected error on iteration 56 at location 40!
Rank 3 detected error on iteration 56 at location 40!
Rank 4 detected error on iteration 56 at location 40!
Itteration 57: block size == 10.
Itteration 58: block size == 10.
Itteration 59: block size == 10.
Itteration 60: block size == 10.
Itteration 61: block size == 10.
Rank 0 detected error on iteration 61 at location 50!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 40 41 42 43 44 45 46 47 48 49
 0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 61 at location 50!
Rank 5 detected error on iteration 61 at location 50!
Rank 3 detected error on iteration 61 at location 50!
Rank 4 detected error on iteration 61 at location 50!
Rank 1 detected error on iteration 61 at location 50!
Itteration 62: block size == 10.
Rank 0 detected error on iteration 62 at location 40!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 1701080649 1684956528 544501349 1953067639 825630821 3027503 0 0 0 0
 0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[40]: "Independent 
write 61/2."

Rank 2 detected error on iteration 62 at location 40!
Rank 5 detected error on iteration 62 at location 40!
Rank 3 detected error on iteration 62 at location 40!
Rank 4 detected error on iteration 62 at location 40!
Rank 1 detected error on iteration 62 at location 40!
Itteration 63: block size == 10.
Itteration 64: block size == 10.
Rank 0 detected error on iteration 64 at location 40!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 1701080649 1684956528 544501349 1953067639 859185253 3027503 0 0 0 0
 0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[40]: "Independent 
write 63/2."

Rank 2 detected error on iteration 64 at location 40!
Rank 5 detected error on iteration 64 at location 40!
Rank 3 detected error on iteration 64 at location 40!
Rank 4 detected error on iteration 64 at location 40!
Rank 1 detected error on iteration 64 at location 40!
Itteration 65: block size == 10.
Itteration 66: block size == 10.
Itteration 67: block size == 10.
Itteration 68: block size == 10.
Itteration 69: block size == 10.
Itteration 70: block size == 10.
Itteration 71: block size == 10.
Itteration 72: block size == 10.
Rank 0 detected error on iteration 72 at location 50!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 40 41 42 43 44 45 46 47 48 49
 0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 72 at location 50!
Rank 5 detected error on iteration 72 at location 50!
Rank 3 detected error on iteration 72 at location 50!
Rank 4 detected error on iteration 72 at location 50!
Rank 1 detected error on iteration 72 at location 50!
Itteration 73: block size == 10.
Itteration 74: block size == 10.
Rank 0 detected error on iteration 74 at location 50!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 40 41 42 43 44 45 46 47 48 49
 0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 74 at location 50!
Rank 5 detected error on iteration 74 at location 50!
Rank 3 detected error on iteration 74 at location 50!
Rank 4 detected error on iteration 74 at location 50!
Rank 1 detected error on iteration 74 at location 50!
Itteration 75: block size == 10.
Rank 0 detected error on iteration 75 at location 50!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 40 41 42 43 44 45 46 47 48 49
 0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 75 at location 50!
Rank 5 detected error on iteration 75 at location 50!
Rank 3 detected error on iteration 75 at location 50!
Rank 4 detected error on iteration 75 at location 50!
Rank 1 detected error on iteration 75 at location 50!
Itteration 76: block size == 10.
Itteration 77: block size == 10.
Rank 0 detected error on iteration 77 at location 40!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 1701080649 1684956528 544501349 1953067639 909582437 3027503 0 0 0 0
 0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[40]: "Independent 
write 76/2."

Rank 2 detected error on iteration 77 at location 40!
Rank 5 detected error on iteration 77 at location 40!
Rank 3 detected error on iteration 77 at location 40!
Rank 4 detected error on iteration 77 at location 40!
Rank 1 detected error on iteration 77 at location 40!
Itteration 78: block size == 10.
Itteration 79: block size == 10.
Rank 0 detected error on iteration 79 at location 40!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 1701080649 1684956528 544501349 1953067639 943136869 3027503 0 0 0 0
 0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[40]: "Independent 
write 78/2."

Rank 2 detected error on iteration 79 at location 40!
Rank 5 detected error on iteration 79 at location 40!
Rank 3 detected error on iteration 79 at location 40!
Rank 4 detected error on iteration 79 at location 40!
Rank 1 detected error on iteration 79 at location 40!
Itteration 80: block size == 10.
Itteration 81: block size == 10.
Itteration 82: block size == 10.
Itteration 83: block size == 10.
Itteration 84: block size == 10.
Rank 0 detected error on iteration 84 at location 40!

Rank 2 detected error on iteration 84 at location 40!
Rank 3 detected error on iteration 84 at location 40!
Rank 5 detected error on iteration 84 at location 40!
 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 1701080649 1684956528 544501349 1953067639 859316325 3027503 0 0 0 0
 0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[40]: "Independent 
write 83/2."

Rank 4 detected error on iteration 84 at location 40!
Rank 1 detected error on iteration 84 at location 40!
Itteration 85: block size == 10.
Itteration 86: block size == 10.
Rank 0 detected error on iteration 86 at location 40!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 1701080649 1684956528 544501349 1953067639 892870757 3027503 0 0 0 0
 0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[40]: "Independent 
write 85/2."

Rank 2 detected error on iteration 86 at location 40!
Rank 3 detected error on iteration 86 at location 40!
Rank 5 detected error on iteration 86 at location 40!
Rank 4 detected error on iteration 86 at location 40!
Rank 1 detected error on iteration 86 at location 40!
Itteration 87: block size == 10.
Rank 0 detected error on iteration 87 at location 53!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 40 41 42 43 44 45 46 47 48 49
 50 51 52 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[53]: ""

Rank 2 detected error on iteration 87 at location 53!
Rank 5 detected error on iteration 87 at location 53!
Rank 3 detected error on iteration 87 at location 53!
Rank 4 detected error on iteration 87 at location 53!
Rank 1 detected error on iteration 87 at location 53!
Itteration 88: block size == 10.
Rank 0 detected error on iteration 88 at location 40!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 1701080649 1684956528 544501349 1953067639 926425189 3027503 0 0 0 0
 0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[40]: "Independent 
write 87/2."

Rank 2 detected error on iteration 88 at location 40!
Rank 5 detected error on iteration 88 at location 40!
Rank 3 detected error on iteration 88 at location 40!
Rank 4 detected error on iteration 88 at location 40!
Rank 1 detected error on iteration 88 at location 40!
Itteration 89: block size == 10.
Rank 0 detected error on iteration 89 at location 40!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 1701080649 1684956528 544501349 1953067639 943202405 3027503 0 0 0 0
 0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[40]: "Independent 
write 88/2."

Rank 2 detected error on iteration 89 at location 40!
Rank 5 detected error on iteration 89 at location 40!
Rank 3 detected error on iteration 89 at location 40!
Rank 4 detected error on iteration 89 at location 40!
Rank 1 detected error on iteration 89 at location 40!
Itteration 90: block size == 10.
Itteration 91: block size == 10.
Rank 0 detected error on iteration 91 at location 50!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 40 41 42 43 44 45 46 47 48 49
 0 0 0 53 54 55 56 57 58 59

String representation of receive buffer starting at rbuf[50]: ""

Rank 2 detected error on iteration 91 at location 50!
Rank 5 detected error on iteration 91 at location 50!
Rank 3 detected error on iteration 91 at location 50!
Rank 4 detected error on iteration 91 at location 50!
Rank 1 detected error on iteration 91 at location 50!
Itteration 92: block size == 10.
Itteration 93: block size == 10.
Itteration 94: block size == 10.
Itteration 95: block size == 10.
Itteration 96: block size == 10.
Rank 0 detected error on iteration 96 at location 40!

 0 1 2 3 4 5 6 7 8 9
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29
 30 31 32 33 34 35 36 37 38 39
 1701080649 1684956528 544501349 1953067639 892936293 3027503 0 0 0 0
 0 0 0 0 0 0 0 0 0 0

String representation of receive buffer starting at rbuf[40]: "Independent 
write 95/2."

Rank 2 detected error on iteration 96 at location 40!
Rank 5 detected error on iteration 96 at location 40!
Rank 3 detected error on iteration 96 at location 40!
Rank 4 detected error on iteration 96 at location 40!
Rank 1 detected error on iteration 96 at location 40!
Itteration 97: block size == 10.
Itteration 98: block size == 10.
Itteration 99: block size == 10.
[main...@honest1 testpar]$ 
============================= sample output from Ranger 
============================
=== Note that this is from a run of an earlier, somewhat different version of   
 ===
=== the test code provided provided above.                                      
 ===
====================================================================================
+ date
Wed Dec 29 14:11:04 CST 2010
+ ibrun ./john
TACC: Starting up job 1746647
TACC: Setting up parallel environment for MVAPICH ssh-based mpirun.
TACC: Setup complete. Running job script.
TACC: starting parallel tasks...
Itteration 0: block size == 10.
Itteration 1: block size == 10.
Itteration 2: block size == 10.
Itteration 3: block size == 10.
Itteration 4: block size == 10.
Itteration 5: block size == 10.
Itteration 6: block size == 10.
Itteration 7: block size == 10.
MPI process terminated unexpectedly
Exit code -5 signaled from i115-104.ranger.tacc.utexas.edu
cleanupKilling remote processes...MPI process terminated unexpectedly
DONE
TACC: MPI job exited with code: 1
TACC: Shutting down parallel environment.
TACC: Shutdown complete. Exiting.
+ date
Wed Dec 29 14:26:42 CST 2010
TACC: Cleaning up after job: 1746647
TACC: Done.
====================================================================================

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to