Florin Isaila schreef:
I have a question about the PVFS2 write performance.

We did some measurements with BTIO over PVFS2 on lonestar at TACC
(http://www.tacc.utexas.edu/services/userguides/lonestar/)

and we get pretty bad write results with classes B and C:

http://www.arcos.inf.uc3m.es/~florin/btio.htm

We used 16 I/O servers, the default configuration parameters and upto
100 processes. We realized that all I/O servers were used also as
metadata servers, but BTIO uses just one file.

The times are in seconds, contain only I/O time (no compute time) and
are aggregated per each BTIO run (BTIO performs several writes).

TroveSyncMeta was set to yes (by default). Could this cause the I/O to
be serialized? It looks as if there were a serialization.

Or could the fact that all nodes were also launched as metadata
managers affect the performance?

I usually take only 1 metadata server; I've never tested configurations
where metaservercount == servercount.

However, I did see jumps in write performance before; In my case,
it was caused by the network itself. I had some older IB drivers that
were loosing packets under high load, which in turn would cause delays until 
some timeout
leading to really bad performance.

I assume pvfs2 is configured to use native IB?

Do you see any timeouts/warnings in the kernel log?

Does the same problem appear using different write patterns?
(for example continuous writes?) The attached program makes each CPU write into 
a different
file... If you have any network congestion problems, the same pattern should 
show up here.

  Dries
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <string.h>

/*
 * Benchmark in which each CPU writes to its own file
 */

/* per file size (MB) */
const int writesize = 64;
const int bufsize = 4; 
const int LOOPS = 3; 

int checkError (int ret)
{
	assert (ret == MPI_SUCCESS); 
	return ret; 
}

int main (int argc, char ** args)
{
	int loop; 
	int commrank;
	char filename[255]; 
	int commsize; 
	MPI_File file; 
	void * buf; 
	double start, stop, time, min, max, total; 
	unsigned long todo; 
	
	MPI_Init (&argc, &args); 

	MPI_Comm_size (MPI_COMM_WORLD, &commsize); 
	MPI_Comm_rank (MPI_COMM_WORLD, &commrank); 

	if (argc != 2)
	{
		fprintf (stderr, "Need filename\n"); 
		MPI_Abort (MPI_COMM_WORLD, 1); 
	}

	buf = malloc (bufsize * 1024 * 1024); 
	memset (buf, (char) commrank+1, bufsize * 1024 * 1024); 

	sprintf (filename, "%s.%u", args[1], commrank); 


	for (loop=0; loop<LOOPS; ++loop)
	{
		MPI_File_delete (filename, MPI_INFO_NULL); 

		checkError (MPI_File_open (MPI_COMM_SELF, filename, MPI_MODE_WRONLY|MPI_MODE_CREATE,
					MPI_INFO_NULL, &file)); 

		todo = writesize ; 

		MPI_Barrier (MPI_COMM_WORLD); 
		start = MPI_Wtime (); 

		while (todo)
		{
			unsigned int thiswrite = (todo > bufsize ? bufsize : todo); 
			todo -= thiswrite; 
			thiswrite *= 1024*1024; 
			MPI_Status status; 
			int count; 

			checkError(MPI_File_write (file, buf, bufsize*1024*1024, MPI_BYTE, &status)); 
			MPI_Get_count (&status, MPI_BYTE, &count); 
			assert (count == thiswrite); 

		}


		stop = MPI_Wtime (); 
		MPI_Barrier (MPI_COMM_WORLD); 
		total = MPI_Wtime(); 

		checkError (MPI_File_close (&file)); 

		time = stop - start; 



		MPI_Reduce (&time, &max, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD); 
		MPI_Reduce (&time, &min, 1, MPI_DOUBLE, MPI_MIN, 0, MPI_COMM_WORLD); 

		if (!commrank)
		{
			fprintf (stdout, "Min %f, max %f, total=%f (MB/s) [total time: %f]", 
					((double)writesize)/max, ((double)writesize)/min,
					((double)writesize*commsize)/(total-start),
					total-start); 
		}
	}

	free (buf); 
	

	
	MPI_Finalize (); 

	return 0; 
}
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to