Re: [Hdf-forum] Efficiently creating and writing to 20, 000 datasets

Quincey Koziol Thu, 13 May 2010 06:50:09 -0700

Hi Xunlei,

On May 13, 2010, at 8:43 AM, Dr. X wrote:


> Hi Quincey,
> My understanding on parallel HDF5 is that it depends on the availability of 
> parallel file system, i.e. GPFS. For instance, I am out of luck whether I am 
> using Windows XP/7 or Windows server (2008), right?

        Yes - we don't support the parallel I/O VFDs (MPI-IO and MPI-POSIX) on 
Windows currently.

> As for Linux (kernel > 2.4), according to
> ftp://ftp.hdfgroup.org/HDF5/current/src/unpacked/release_docs/INSTALL_parallel
> even on a multi-core laptop, I should be able to access PHDF5 functionalities.
> Is this correct? Thanks a  lot.

        Yes, I test parallel I/O on my MacBookPro all the time.  :-)

        Quincey


> Best,
> xunlei
> 
> 
> On 5/13/2010 8:08 AM, Quincey Koziol wrote:
>> Hi Mark&  Mark, :-)
>> 
>> On May 12, 2010, at 2:13 PM, Mark Miller wrote:
>> 
>>   
>>> Hi Mark,
>>> 
>>> 
>>> On Wed, 2010-05-12 at 12:01, Mark Howison wrote:
>>>     
>>>> Hi Mark,
>>>> 
>>>> All dataspaces are 1D. Currently, the datasets are contiguous. The
>>>> size of each dataset is available before the writes occur.
>>>> 
>>>> There is a phase later where a large MPI communicator performs
>>>> parallel reads of the data, which is why we are using the parallel
>>>> version of the library. I think that the VFDs you are suggesting are
>>>> only available in the serial library, but I could be mistaken.
>>>>       
>>> Well, for any given libhdf5.a, the other vfds are generally always
>>> available. I think direct and mpi-related vfds are the only ones which
>>> might not be available depending on how HDF5 was configured prior to
>>> installation. So, if they are suitable for your needs, you should be
>>> able to use those other vfds, even from a parallel application.
>>>     
>>      Yes, parallel HDF5 is a superset of serial HDF5 and all the VFDs are 
>> available.
>> 
>>      Is each individual file created in the first phase accessed in parallel 
>> later?  If so, it might be reasonable to use the core VFD for creating the 
>> files, then close all the files and re-open them with the MPI-IO VFD.
>> 
>>      Quincey
>> 
>>   
>>> Mark
>>> 
>>> 
>>>     
>>>> Thanks,
>>>> Mark
>>>> 
>>>> On Tue, May 11, 2010 at 4:33 PM, Mark Miller<[email protected]>  wrote:
>>>>       
>>>>> Hi Mark,
>>>>> 
>>>>> Since you didn't explicitly describe the H5Dcreate/H5Dwrite calls, I'll
>>>>> probably wind up asking some silly questions, but...
>>>>> 
>>>>> How big are the dataspaces being written in H5Dwrite?
>>>>> 
>>>>> Are the datasets being created with chunked or contiguous storage?
>>>>> 
>>>>> Why are you even bothering with MPI-IO in this case? Since each
>>>>> processor is writing to its own file, why not use sec2 vfd or maybe even
>>>>> stdio vfd, or mpiposix? Or, you could try split vfd and use 'core' vfd
>>>>> for metadata and either sec2, stdio or mpiposix vfd for raw. That
>>>>> results in two actual 'files' on disk for every 'file' a task creates
>>>>> but if this is for out-of-core, you'll soon be deleting them anyways.
>>>>> Using the split vfd in this way means that all metadata will get held in
>>>>> memory (in the core vfd) until file is closed and then it'll get written
>>>>> in one large I/O request. Raw data gets handled as usual.
>>>>> 
>>>>> Well, thats some options to try at least.
>>>>> 
>>>>> Good luck.
>>>>> 
>>>>> Mark
>>>>> 
>>>>> What version of HDF5 is this?
>>>>> On Tue, 2010-05-11 at 16:23 -0700, Mark Howison wrote:
>>>>>         
>>>>>> Hi,
>>>>>> 
>>>>>> I'm helping a user at NERSC modify an out-of-core matrix calculation
>>>>>> code to use HDF5 for temporary storage. Each of his 30 MPI tasks is
>>>>>> writing to its own file using the MPI-IO VFD in independent mode with
>>>>>> the MPI_COMM_SELF communicator. He is creating about 20,000 datasets
>>>>>> and writing anywhere from 4KB to 32MB to each one. In IO profiles, we
>>>>>> are seeing a huge spike in<1KB writes (about 100,000). My questions
>>>>>> are:
>>>>>> 
>>>>>> * Are these small writes we are seeing associated with dataset metadata?
>>>>>> 
>>>>>> * Is there a "best practice" for handling this number of datasets? For
>>>>>> instance, is it better to pre-allocate the datasets before writing to
>>>>>> them?
>>>>>> 
>>>>>> Thanks
>>>>>> Mark
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>> [email protected]
>>>>>> http://**mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>>>>> 
>>>>>>           
>>>>> --
>>>>> Mark C. Miller, Lawrence Livermore National Laboratory
>>>>> ================!!LLNL BUSINESS ONLY!!================
>>>>> [email protected]      urgent: [email protected]
>>>>> T:8-6 (925)-423-5901    M/W/Th:7-12,2-7 (530)-753-8511
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Hdf-forum is for HDF software users discussion.
>>>>> [email protected]
>>>>> http://*mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>>>> 
>>>>>         
>>>> _______________________________________________
>>>> Hdf-forum is for HDF software users discussion.
>>>> [email protected]
>>>> http://*mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>>>       
>>> -- 
>>> Mark C. Miller, Lawrence Livermore National Laboratory
>>> ================!!LLNL BUSINESS ONLY!!================
>>> [email protected]      urgent: [email protected]
>>> T:8-6 (925)-423-5901     M/W/Th:7-12,2-7 (530)-753-851
>>> 
>>> 
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> [email protected]
>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>>     
>> 
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [email protected]
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>   
> 
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Efficiently creating and writing to 20, 000 datasets

Reply via email to