Thanks for the suggestion! When I had each rank run on a separate compute node/host, I saw parallel performance (4 seconds for the 6GB of writing). When I ran the MPI job on one host (the hosts have 12 cores, by default we pack ranks onto as few hosts as possible), things happened serially, each rank finished about 2 seconds after a different rank. I'm told that the hosts can handle a lot of I/O, but it seems there a some issues with getting that to work well. I believe we get good performance with different ranks on one host reading from different files. I'll look into tuning the MPI/Hdf5 parameter now, with an eye for designing my application to write from different hosts. My initial tests with MPI showed degraded performance when I used different hosts for the writing, but maybe there are some parameters that will help. I can try the openmpi forum at that point.
best, David Schneider ________________________________________ From: Mohr Jr, Richard Frank (Rick Mohr) [[email protected]] Sent: Tuesday, May 19, 2015 9:15 AM To: Schneider, David A. Cc: [email protected] Subject: Re: [lustre-discuss] problem getting high performance output to single file > On May 19, 2015, at 11:40 AM, Schneider, David A. > <[email protected]> wrote: > > When working from hdf5 and mpi, I have seen a number of references about > tuning parameters, I haven't dug into this yet. I first want to make sure > lustre has the high output performance at a basic level. I tried to write a C > program uses simple POSIX calls (open and looping over writes) but I don't > see much increase in performance (I've tried 8 and 19 OST's, 1MB and 4MB > chunks, I write a 6GB file). > > Does anyone know if this should work? What is the simplest C program I could > write to see an increase in output performance after I stripe? Do I need > separate processes/threads with separate file handles? If you are looking for a simple shared-file test, you could try something like this: 1) Create a file with a stripe size of 1 GB and a stripe count of 6. 2) Write an MPI program where each process writes 1 GB of sequential data. Each process should first seek to (mpi_rank)*(1GB) and then write 1 GB. This will ensure that all processes are writing to non-overlapping parts of the file. 3) Start the program running on 6 nodes (1 process per node). In a scenario like that, you should effectively be getting file-per-process speeds even though you are writing to a shared file because each process is writing to a different OST. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
