Florin Isaila schreef:
I have a question about the PVFS2 write performance.
We did some measurements with BTIO over PVFS2 on lonestar at TACC
(http://www.tacc.utexas.edu/services/userguides/lonestar/)
and we get pretty bad write results with classes B and C:
http://www.arcos.inf.uc3m.es/~florin/btio.htm
We used 16 I/O servers, the default configuration parameters and upto
100 processes. We realized that all I/O servers were used also as
metadata servers, but BTIO uses just one file.
The times are in seconds, contain only I/O time (no compute time) and
are aggregated per each BTIO run (BTIO performs several writes).
TroveSyncMeta was set to yes (by default). Could this cause the I/O to
be serialized? It looks as if there were a serialization.
Or could the fact that all nodes were also launched as metadata
managers affect the performance?
I usually take only 1 metadata server; I've never tested configurations
where metaservercount == servercount.
However, I did see jumps in write performance before; In my case,
it was caused by the network itself. I had some older IB drivers that
were loosing packets under high load, which in turn would cause delays until
some timeout
leading to really bad performance.
I assume pvfs2 is configured to use native IB?
Do you see any timeouts/warnings in the kernel log?
Does the same problem appear using different write patterns?
(for example continuous writes?) The attached program makes each CPU write into
a different
file... If you have any network congestion problems, the same pattern should
show up here.
Dries
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <string.h>
/*
* Benchmark in which each CPU writes to its own file
*/
/* per file size (MB) */
const int writesize = 64;
const int bufsize = 4;
const int LOOPS = 3;
int checkError (int ret)
{
assert (ret == MPI_SUCCESS);
return ret;
}
int main (int argc, char ** args)
{
int loop;
int commrank;
char filename[255];
int commsize;
MPI_File file;
void * buf;
double start, stop, time, min, max, total;
unsigned long todo;
MPI_Init (&argc, &args);
MPI_Comm_size (MPI_COMM_WORLD, &commsize);
MPI_Comm_rank (MPI_COMM_WORLD, &commrank);
if (argc != 2)
{
fprintf (stderr, "Need filename\n");
MPI_Abort (MPI_COMM_WORLD, 1);
}
buf = malloc (bufsize * 1024 * 1024);
memset (buf, (char) commrank+1, bufsize * 1024 * 1024);
sprintf (filename, "%s.%u", args[1], commrank);
for (loop=0; loop<LOOPS; ++loop)
{
MPI_File_delete (filename, MPI_INFO_NULL);
checkError (MPI_File_open (MPI_COMM_SELF, filename, MPI_MODE_WRONLY|MPI_MODE_CREATE,
MPI_INFO_NULL, &file));
todo = writesize ;
MPI_Barrier (MPI_COMM_WORLD);
start = MPI_Wtime ();
while (todo)
{
unsigned int thiswrite = (todo > bufsize ? bufsize : todo);
todo -= thiswrite;
thiswrite *= 1024*1024;
MPI_Status status;
int count;
checkError(MPI_File_write (file, buf, bufsize*1024*1024, MPI_BYTE, &status));
MPI_Get_count (&status, MPI_BYTE, &count);
assert (count == thiswrite);
}
stop = MPI_Wtime ();
MPI_Barrier (MPI_COMM_WORLD);
total = MPI_Wtime();
checkError (MPI_File_close (&file));
time = stop - start;
MPI_Reduce (&time, &max, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD);
MPI_Reduce (&time, &min, 1, MPI_DOUBLE, MPI_MIN, 0, MPI_COMM_WORLD);
if (!commrank)
{
fprintf (stdout, "Min %f, max %f, total=%f (MB/s) [total time: %f]",
((double)writesize)/max, ((double)writesize)/min,
((double)writesize*commsize)/(total-start),
total-start);
}
}
free (buf);
MPI_Finalize ();
return 0;
}
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users