On Fri, 2007-09-02 at 08:59 -0500, Weikuan Yu wrote:
> Hi,

Hi,

> 2) Inaccurate reports due to any drift in time

> -- Time drift is an annoying problem of IOR. IOR checks on the skew
> of timestamps from each process. But it does not calibrate the timer at
> the beginning. So it spews numerous warning on systems with big drift
> across nodes. In addition, it reports wrong numbers for IO rate. Added
> recalibration still makes your numbers more accurate, even if you did not
> notice you have a problem before.

We are encountering this issue and am recalling this posting of yours.
After some investigation, un-synchronized clocks between nodes is not
the only problem.

For example, I have a cluster of 127 nodes.  I use ntp to keep that
cluster in sync as can be seen:

$ pdsh -S -w o[1-2,4-128] date | dshbak -c
----------------
o[98-128]
----------------
 Mon Mar 19 22:25:54 GMT 2007
----------------
o[66-97]
----------------
 Mon Mar 19 22:25:53 GMT 2007
----------------
o[1-2,5,14,19,34-65]
----------------
 Mon Mar 19 22:25:52 GMT 2007
----------------
o[4,6-13,15-18,20-33]
----------------
 Mon Mar 19 22:25:51 GMT 2007

As you can see, at most between any two nodes we only have 4 seconds of
drift, yet MPI registers a much much bigger slew between nodes:

$ mpirun -np 127 -machinefile machfile -nolocal mpi_time | sort -k 1 -n -k 6
pass1: on o128 time is 1.320312
pass1: on o127 time is 2.187500
pass1: on o126 time is 3.273438
pass1: on o125 time is 4.140625
pass1: on o124 time is 5.222656
...
pass1: on o6 time is 120.476562
pass1: on o5 time is 121.542969
pass1: on o4 time is 122.625000
pass1: on o2 time is 123.691406
pass1: on o1 time is 124.488281

And this deviation in pass1 correlates quite closely to the amount of
time it takes mpich1 to get all of the nodes up and running with the MPI
program:

o1: Mar 19 21:40:43 orion1 in.rshd[10105]: [EMAIL PROTECTED] as root: 
cmd='/usr/src/brian/mpi_time o1 34415 \-p4amslave \-p4yourname orion1 
\-p4rmrank 1'
o2: Mar 19 22:56:15 orion2 in.rshd[9942]: [EMAIL PROTECTED] as root: 
cmd='/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname orion2 
\-p4rmrank 1'
o4: Mar 19 22:56:16 orion4 in.rshd[9765]: [EMAIL PROTECTED] as root: 
cmd='/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname orion4 
\-p4rmrank 2'
o5: Mar 19 22:56:17 orion5 in.rshd[9752]: [EMAIL PROTECTED] as root: 
cmd='/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname orion5 
\-p4rmrank 3'
o6: Mar 19 22:56:18 orion6 in.rshd[9731]: [EMAIL PROTECTED] as root: 
cmd='/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname orion6 
\-p4rmrank 4'
...
o124: Mar 19 22:58:13 orion124 in.rshd[9193]: [EMAIL PROTECTED] as root: 
cmd='/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname orion124 
\-p4rmrank 122'
o125: Mar 19 22:58:14 orion125 in.rshd[9193]: [EMAIL PROTECTED] as root: 
cmd='/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname orion125 
\-p4rmrank 123'
o126: Mar 19 22:58:15 orion126 in.rshd[9187]: [EMAIL PROTECTED] as root: 
cmd='/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname orion126 
\-p4rmrank 124'
o127: Mar 19 22:58:16 orion127 in.rshd[9202]: [EMAIL PROTECTED] as root: 
cmd='/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname orion127 
\-p4rmrank 125'
o128: Mar 19 22:58:17 orion128 in.rshd[9186]: [EMAIL PROTECTED] as root: 
cmd='/usr/src/brian/mpi_time o1 35073 \-p4amslave \-p4yourname orion128 
\-p4rmrank 126'

So the time deviation is counting the 2 or so minutes it takes to get
127 nodes all up and running the MPI program.

If I (much as you did with your patch) take the initial timestamp and
correct the time returned by GetTimeStamp() with it, I get a much better
value from it:

pass2: on o2 time is 0.000000
pass2: on o67 time is 0.000000
pass2: on o99 time is 0.000000
pass2: on o35 time is 0.003906
pass2: on o100 time is 0.093750
...
pass2: on o91 time is 0.402344
pass2: on o93 time is 0.402344
pass2: on o95 time is 0.402344
pass2: on o97 time is 0.402344
pass2: on o9 time is 0.402344

Since I have not really audited IOR to the point of understanding all of
the timekeeping in it, I wonder, is this algorithm (correcting for the
difference in startup times of the remote processes) incorrect?

The source for mpi_time.c:

main(int argc, char **argv) {

    double initial_timestamp, timestamp;
    int  namelen;
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int numTasksWorld = 0;
    int rank = 0;

    /* start the MPI code */
    MPI_CHECK(MPI_Init(&argc, &argv), "cannot initialize MPI");
    MPI_CHECK(MPI_Comm_size(MPI_COMM_WORLD, &numTasksWorld),
              "cannot get number of tasks");
    MPI_CHECK(MPI_Comm_rank(MPI_COMM_WORLD, &rank), "cannot get rank");
    MPI_CHECK(MPI_Get_processor_name(processor_name, &namelen),
              "cannot get processor name");

    MPI_CHECK(MPI_Barrier(MPI_COMM_WORLD), "barrier error");
    initial_timestamp = GetTimeStamp();
    fprintf(stdout, "pass1: on %s time is %f\n", processor_name,
            initial_timestamp);
    MPI_CHECK(MPI_Barrier(MPI_COMM_WORLD), "barrier error");
    timestamp = GetTimeStamp();
    fprintf(stdout, "pass2: on %s time is %f\n", processor_name,
            timestamp - initial_timestamp);

    MPI_CHECK(MPI_Finalize(), "cannot finalize MPI");
    return 0;

}

(which uses MPI_CHECK() and GetTimeStamp() from IOR.c)

> Let me know if you may have some comments.

The only thing I'd say is that your implementation of InitTimeStamp()
seems almost redundant given GetTimeStamp().  Why not pass correction as
an argument to GetTimeStamp() and set it to 0 to initialize and the
init_timeval thereafter?  Too messy?  Perhaps.

init_timeval = GetTimeStamp(0) to initialize and
GetTimeStamp(init_timeval) thereafter

I wonder in fact if GetTimeStamp() can't do this initialization and
always account for the init_timeval?  init_timeval could be stored as a
static in GetTimeStamp() along with an "initialized" static boolean so
that fist time through it's initialized.  It's tempting to use a 0 in
init_timeval as the "not initialized" flag but it could legitimately be
0 I think.

b.


_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to