Re: [Hdf-forum] Performance hints for large dataset

Martin Sarajærvi Mon, 16 Jun 2014 05:55:24 -0700

Hi Ger,

Thanks for your reply.


I have played a bit with the cache size, but not tried your particular
suggestion.

Sure I could optimize the chunk sizes for each of the slicing directions,
but this does not really solve my problem as I can only have 1 chunk
setting for my dataset or am I missing something? So if I optimize
for  (*,y,*,*) including adjusting the cache setting the (*, *, z, w)
slicing would still be slow.

I would be interested in checking your program for chunk/cache size testing
(running Linux here).

Best regards,
Martub


On Mon, Jun 16, 2014 at 8:26 AM, Ger van Diepen <[email protected]> wrote:

>  Hi Martin,
>
>  Have you set the chunk cache sufficiently large? Otherwise it will
> reread the same chunks again and again. Allthough the system file cache
> might hold all those data, I think it's better to size the cache correctly
> because of the lookups HDF5 is doing.
>
> E.g. in the case of (*,y,*,*) you'll need a cache of 601*8*61*1501 floats
> (1.64 GB). I assume have sufficient memory, otherwise you could adjust the
> chunk size, especially in z,w.
>
> Your chunks are not particularly large (16384 bytes) leading to a lot of
> iops and a large B-tree to index the chunks. On the other hand, when
> enlarging the chunks, you''ll need more memory for the chunk cache.
>
>  What is the pattern when accessing the data as *,*,z,w? First w, and
> thereafter all z? You'll need a much smaller cache when accessing it like
>
>     for w in 0:nw/ncw    (nw is length of w-axis; ncw is chunk-size in w)
>
>       for z in 0:nz/ncz
>
>         for w1 in 0:ncw
>
>           for z1 in 0:ncz
>
> In this way you handle a full z,w chunk before moving to the next one, so
> your cache size needs to be only 601*482*8*8.
>
>  I have a program testing 3D data sets of arbitrary size and chunk size
> using a cache size depending on the chunk size and access pattern. If you
> like to, I can send it.
>
>  Cheers,
>
> Ger
>
> >>> Matthieu Brucher <[email protected]> 6/12/2014 10:56 PM >>>
>
> Hi,
>
> Unfortunately, this is indeed the worst you can have. It's completely
> normal that you have the worst performance with slicing in these
> dimensions. Even with a parallel filesystem, you would need to read
> EVERYTHING from the dataset, and then the library would pick up the
> pieces you need.
> One solution would be to agglomerate several z,w in dimensions 5 and
> 6, so that you still get some performance, but it will be worse than 1
> or even 2.
>
> Cheers,
>
> Matthieu
>
>
> 2014-06-12 20:43 GMT+01:00 Martin Sarajærvi <[email protected]>:
> > Hi all,
> >
> > I'm working with floating point data building up a very large dataset
> > typically >100Gb of four dimensions (x, y, z, w).
> > Dimensions are of the size (x,y,z,w) = (601, 482, 61, 1501) in my
> example.
> >
> > The aim is to slice (READING ONLY) this dataset in orthogonal directions:
> > 1) (x, *, *, *)
> > 2) (*, y, *, *)
> > 3) (*, *, z, w)
> >
> > When using a contiguous layout I naturally get good performance for
> > directions (1) and (2), however it is very poor for (3).
> > Using a chunking layout of (8,8,8,8) seem to give the best balance so far
> > for reasonable access times in all directions. but still not as fast as I
> > was hoping for. My tests also show that compression improves the read
> > performance slightly.
> >
> > I'm looking for advise on possible optimization techniques to use for
> this
> > problem other than what has been mentioned.
> > Otherwise, is my only option to move to some (expensive?) parallel
> solution?
> >
> > Thanks!
> >
> > Regards,
> > Martin
> >
> > _______________________________________________
> > Hdf-forum is for HDF software users discussion.
> > [email protected]
> >
> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> > Twitter: https://twitter.com/hdf5
>
>
>
> --
> Information System Engineer, Ph.D.
> Blog: http://matt.eifelle.com
> LinkedIn: http://www.linkedin.com/in/matthieubrucher
> Music band: http://liliejay.com/
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
>
> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
>
> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Performance hints for large dataset

Reply via email to