On 05/25/2016 05:09 AM, Angel de Vicente wrote:
Hi, our Fortran code works with double precision data, but recently I have added the possibility of saving the data to files in single precision. In our workstation there is no problem with this and the writing time in either double or single precision is basically the same. But in a cluster that uses GPFS the writing to single precision slows down drastically.
When the type-in-memory is different from the type-in-file, HDF has to break collective i/o. there are some property lists you can interrogate to confirm this (H5Pget_mpio_no_collective_cause; returns a bitfield you'll have to parse yourself, unless HDF has provided a "flags to string" routine that I do not know about)
In such cases you'll see better performance if you convert your type in memory then write.
In Parallel-NetCDF we make the memory copies and type conversions inside the library, but that can be expensive and might surprise the user, so I can understand HDF5's "never copy the buffer" stance.
==rob
To see the problem, I have written a test code, which just creates a data cube of size 288x288x112, and run it in 32 processors (each one ending up with a 72x72x56 cube), 2 nodes in the given cluster. One function (write_full_var_double) writes the data as double precision data, while the other (write_full_var_single) writes the data as single precision. Each function is called 10 times, and the reported times to do the writings are completely different (0.24s to write in double precision, 8.61s to do it in single precision). Timing report: Timer Number Iterations Mean real time ---------------------------------------- Write in double precision 10 0.2393E+00 Write in single precision 10 0.8615E+01 I have also tried with MPI-IO hints (see lines 71-83 in code io.F90), since this was helping for other trouble with PHDF5 that we had in the past, but in this case this doesn't seem to help. Do you have any idea why the writing degrades so much for single precision, and what could I do to alleviate the issue? The code can be obtained from: git clone https://[email protected]/angelv/io_timing.git The number of processes needed is hardcoded in io.F90 as the variables nblockx, nblocky and nblockz (number of processes = nblockx*nblocky*nblockz). As it is, the code is meant to be run in 1 processor. To run it in 32 processors as above, just change nblockx=4,nblocky=4,nblockz=2, and compile it as per the file make.sh. Any pointers/suggestions welcome. Many thanks,
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
