Neil,

Thanks for your insight. I've tried using the h5pset_cache_f function, and that certainly helps, but I'm still getting plenty of excess unused space. It's an order of magnitude better -- hundred of megabytes instead of the gigabytes I was seeing before. If I buffer the data myself before writing, it will compress to about 50M (from a baseline of around 140M uncompressed). Hence, that is still my preferred strategy for now.

Perhaps I'm misusing the cache settings? I've tried a few options, but currently, I'm using something like this:

call h5pset_cache_f(fap_list, 0, int(19997, size_t), int(1024 * 1024 * 1024, size_t), .75, hdferr)

That should be far greater than necessary for my datasets.

(Side note: Is there a H5D_CHUNK_CACHE_W0_DEFAULT defined for Fortran?)

Thanks for your help!
--Patrick

On 4/11/2016 1:28 PM, Neil Fortner wrote:
Patrick,

What you are seeing is that, since the chunk cache is not large enough to hold 
a single chunk in memory, every write needs to go directly to disk. Without 
compression, this works but causes a write to disk with every write you make to 
the dataset, instead of a single write for the whole chunk. It could be even 
worse if the slice through the chunk you are writing is not contiguous, which 
looks to be the case here.

With compression, since each chunk is compressed and decompressed as a single 
unit, every write call forces the library to read the chunk from disk, 
decompress, write to the buffer, recompress the modified buffer, and write it 
back to disk. Since the chunk can change size in doing this, it may be 
necessary to move it around the file causing fragmentation and unused space in 
the file.

To fix this, you should increase the chunk cache size (via H5Pset_cache or 
H5Pset_chunk_cache) to be able to hold at least one full chunk, or more if you 
are striping writes across multiple chunks or otherwise need to hold multiple 
chunks in cache. This will allow the library to hold the chunk in memory 
between write calls, and avoid flushing to disk until the chunk is complete.

Dan,

If the chunk cache is sized correctly then it should not flush the chunk 
prematurely. Do you have an example program that shows this problem?

Thanks,
-Neil

________________________________________
From: Hdf-forum <[email protected]> on behalf of Daniel Tetlow 
<[email protected]>
Sent: Wednesday, March 23, 2016 11:04 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Deflate and partial chunk writes

Hi,

I've had a similar experience with this writing streams of 2D data, and I've 
also noticed that performance is much slower if I don't write whole chunks at a 
time. I would have thought (assuming you've sized the chunk cache suitably) 
that each 1000x200x1 write would gradually fill up a 1000x200x50 chunk, then 
some time later that whole chunk would be deflated once when its evicted from 
the cache and written to disk once. But based on the performance I see I can 
only guess it's not working like this, so I also just buffer whole chunks 
myself.

Dan


-----Original Message-----
From: Hdf-forum [mailto:[email protected]] On Behalf Of 
Patrick Vacek
Sent: 22 March 2016 20:55
To: [email protected]
Subject: [Hdf-forum] Deflate and partial chunk writes

Hello!

I've found an interesting situation that seems like something of a bug to me. 
I've figured out how to work around it, but I wanted to bring it up in case it 
comes up for anyone else.

I use the Fortran API, and I typically create HDF5 datasets with large, 
multidimensional chunk sizes, but I only write part of that chunk size at any 
given time. For example, I'll use a chunk size of 1000 x 200 x 50 but only 
write 1000 x 200 x 1 elements at a time. This seems to work fine, although on 
networked filesystems, I sometimes notice that my application is I/O-limited. 
The solution is to buffer our HDF5 writes locally and then write a full chunk 
at a time.

Recently, I decided to try out the deflate/zlib filter. I've noticed that when 
I buffer the data locally and write a full chunk at a time, it works 
beautifully and compresses nicely. But if I do not write a full chunk at a time 
(say just 1000 x 200 x 1 elements), then my HDF5 file explodes in size. When I 
examine it with h5stat, I see that the 'raw data' size is about what I'd expect 
(tens of megabytes), but the 'unaccounted space' size is a few gigabytes.

  From what I can tell, it looks like the deflate filter is applied to the full 
chunk, despite that I haven't written the whole thing yet, and as I add more to 
it, it doesn't overwrite, remove, or re-optimize the parts it has already 
written. It's as if it deflates a full chunk for each small-ish write. I 
haven't seen anything in the documentation or the forum to confirm this, but 
this seems like a problem. If it isn't something easily addressed, I think 
there should perhaps be a warning about this inefficiency in the documentation 
for the deflate filter.

Thanks!

--
Patrick Vacek
Engineering Scientist Associate
Applied Research Labs, University of Texas


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to