Hi Mark,
On Feb 22, 2011, at 4:42 PM, Mark Miller wrote:
> On Tue, 2011-02-22 at 14:06, Quincey Koziol wrote:
>
>>
>> Well, as I say above, with this approach, you push the space
>> allocation problem to the dataset creation step (which has it's own
>> set of problems),
>
> Yeah, but those 'problems' aren't new to parallel I/O issues. Anyone
> that is currently doing concurrent parallel I/O with HDF5 has had to
> already deal with this part of the problem -- space allocation at
> dataset creation -- right? The point is the caller of HDF5 then knows
> how big it will be after its been compressed and HDF5 doesn't have to
> 'discover' that during H5Dwrite. Hmm puzzling...
True, yes.
> I am recalling my suggestion of a '2-pass-planning' VFD where the caller
> executes slew of HDF5 operations on a file TWICE. The first pass, HDF5
> doesn't do any of the actual raw data I/O but just records all the
> information about it for a 'repeat performance' second pass. In the
> second pass, HDF5 knows everything about what is 'about to happen' and
> then can plan accordingly.
Ah, yes, that may be a good segue into this two-pass feature. I've
been thinking about this feature and wondering about how to implement it.
Something that occurs to me would be to construct it like a "transaction",
where the application opens a transaction, the HDF5 library just records those
operations performed with API routines, then when the application closes the
transaction, they are replayed twice: once to record the results of all the
operations, and then a second pass that actually performs all the I/O. That
would help to reduce the overhead from the collective metadata modification
overhead also.
> What about maybe doing that on a dataset-at-a-time basis? I mean, what
> if you set dxpl props to indicate either 'pass 1' or 'pass 2' of a
> 2-pass H5Dwrite operation. On pass 1, between H5Dopen and H5Dclose,
> H5Dwrites don't do any of the raw data I/O but do apply filters and
> compute sizes of things it will eventually write. On H5Dclose of pass 1,
> all the information of chunk sizes is recorded. Caller then does
> everything again, a second time but sets 'pass' to 'pass 2' in dxpl for
> H5Dwrite calls and everything 'works' because all processors know
> everything they need to know.
Ah, I like this also!
>> Maybe HDF5 could expose an API routine that the application could
>> call, to pre-compress the data by passing it through the I/O filters?
>
> I think that could be useful in any case. Like its now possible to apply
> type conversion to a buffer of bytes, it probably ought to be possible
> to apply any 'filter' to a buffer of bytes. The second half of this
> though would involve smartening HDF5 then to 'pass-through' pre-filtered
> data so result is 'as if' HDF5 had done the filtering work itself during
> H5Dwrite. Not sure how easy that would be ;) But, you asked for
> comments/input.
Yes, that's the direction I was thinking about going.
I think the transaction idea I mentioned above might be the most
general and have the highest payoff. It could even be implemented with poor
man's parallel I/O, when the transaction concluded.
Quincey
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org