Re: [Hdf-forum] Efficiently creating and writing to 20, 000 datasets

Dr. X Thu, 13 May 2010 06:45:48 -0700

Hi Quincey,

My understanding on parallel HDF5 is that it depends on the availabilityof parallel file system, i.e. GPFS. For instance, I am out of luckwhether I am using Windows XP/7 or Windows server (2008), right?

As for Linux (kernel > 2.4), according to
ftp://ftp.hdfgroup.org/HDF5/current/src/unpacked/release_docs/INSTALL_parallel

even on a multi-core laptop, I should be able to access PHDF5functionalities.

Is this correct? Thanks a  lot.


Best,
xunlei


On 5/13/2010 8:08 AM, Quincey Koziol wrote:

Hi Mark&  Mark, :-)

On May 12, 2010, at 2:13 PM, Mark Miller wrote:

Hi Mark,


On Wed, 2010-05-12 at 12:01, Mark Howison wrote:

Hi Mark,

All dataspaces are 1D. Currently, the datasets are contiguous. The
size of each dataset is available before the writes occur.

There is a phase later where a large MPI communicator performs
parallel reads of the data, which is why we are using the parallel
version of the library. I think that the VFDs you are suggesting are
only available in the serial library, but I could be mistaken.

Well, for any given libhdf5.a, the other vfds are generally always
available. I think direct and mpi-related vfds are the only ones which
might not be available depending on how HDF5 was configured prior to
installation. So, if they are suitable for your needs, you should be
able to use those other vfds, even from a parallel application.

        Yes, parallel HDF5 is a superset of serial HDF5 and all the VFDs are 
available.

        Is each individual file created in the first phase accessed in parallel 
later?  If so, it might be reasonable to use the core VFD for creating the 
files, then close all the files and re-open them with the MPI-IO VFD.

        Quincey

Mark

Thanks,
Mark

On Tue, May 11, 2010 at 4:33 PM, Mark Miller<[email protected]>  wrote:

Hi Mark,

Since you didn't explicitly describe the H5Dcreate/H5Dwrite calls, I'll
probably wind up asking some silly questions, but...

How big are the dataspaces being written in H5Dwrite?

Are the datasets being created with chunked or contiguous storage?

Why are you even bothering with MPI-IO in this case? Since each
processor is writing to its own file, why not use sec2 vfd or maybe even
stdio vfd, or mpiposix? Or, you could try split vfd and use 'core' vfd
for metadata and either sec2, stdio or mpiposix vfd for raw. That
results in two actual 'files' on disk for every 'file' a task creates
but if this is for out-of-core, you'll soon be deleting them anyways.
Using the split vfd in this way means that all metadata will get held in
memory (in the core vfd) until file is closed and then it'll get written
in one large I/O request. Raw data gets handled as usual.

Well, thats some options to try at least.

Good luck.

Mark

What version of HDF5 is this?
On Tue, 2010-05-11 at 16:23 -0700, Mark Howison wrote:

Hi,

I'm helping a user at NERSC modify an out-of-core matrix calculation
code to use HDF5 for temporary storage. Each of his 30 MPI tasks is
writing to its own file using the MPI-IO VFD in independent mode with
the MPI_COMM_SELF communicator. He is creating about 20,000 datasets
and writing anywhere from 4KB to 32MB to each one. In IO profiles, we
are seeing a huge spike in<1KB writes (about 100,000). My questions
are:

* Are these small writes we are seeing associated with dataset metadata?

* Is there a "best practice" for handling this number of datasets? For
instance, is it better to pre-allocate the datasets before writing to
them?

Thanks
Mark

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://**mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
[email protected]      urgent: [email protected]
T:8-6 (925)-423-5901    M/W/Th:7-12,2-7 (530)-753-8511


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://*mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://*mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
[email protected]      urgent: [email protected]
T:8-6 (925)-423-5901     M/W/Th:7-12,2-7 (530)-753-851


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org



_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Efficiently creating and writing to 20, 000 datasets

Reply via email to