Many thanks for your comprehensive recommendations. I think HDF5 views are probably what I need to go with - will read up more and then ask.
What I mean about dimension is rank, really. The shape is always the same for all samples. One slice for storage, i.e., one sample, could be chunked as dset[:,:,i] or dset[:,:,:,:,i] but always of the form, dset[:,...,:i], depending on input to the code at runtime. Thanks On 13 September 2016 at 14:47, Erik Schnetter <[email protected]> wrote: > On Tue, Sep 13, 2016 at 11:27 AM, sparrowhawker <[email protected]> > wrote: > >> Hi, >> >> I'm new to Julia, and have been able to accomplish a lot of what I used >> to do in Matlab/Fortran, in very little time since I started using Julia in >> the last three months. Here's my newest stumbling block. >> >> I have a process which creates nsamples within a loop. Each sample takes >> a long time to compute as there are expensive finite difference operations, >> which ultimately lead to a sample, say 1 to 10 seconds. I have to store >> each of the nsamples, and I know the size and dimensions of each of the >> nsamples (all samples have the same size and dimensions). However, >> depending on the run time parameters, each sample may be a 32x32 image or >> perhaps a 64x64x64 voxset with 3 attributes, i.e., a 64x64x64x3 >> hyper-rectangle. To be clear, each sample can be an arbitrary dimension >> hyper-rectangle, specified at run time. >> >> Obviously, since I don't want to lose computation and want to see >> incremental progress, I'd like to do incremental saves of these samples on >> disk, instead of waiting to collect all nsamples at the end. For instance, >> if I had to store 1000 samples of size 64x64, I thought perhaps I could >> chunk and save 64x64 slices to an HDF5 file 1000 times. Is this the right >> approach? If so, here's a prototype program to do so, but it depends on my >> knowing the number of dimensions of the slice, which is not known until >> runtime, >> >> using HDF5 >> >> filename = "test.h5" >> # open file >> fmode ="w" >> # get a file object >> fid = h5open(filename, fmode) >> # matrix to write in chunks >> B = rand(64,64,1000) >> # figure out its dimensions >> sizeTuple = size(B) >> Ndims = length(sizeTuple) >> # set up to write in chunks of sizeArray >> sizeArray = ones(Int, Ndims) >> [sizeArray[i] = sizeTuple[i] for i in 1:(Ndims-1)] # last value of size >> array is :...:,1 >> # create a dataset models within root >> dset = d_create(fid, "models", datatype(Float64), dataspace(size(B)), >> "chunk", sizeArray) >> [dset[:,:,i] = slicedim(B, Ndims, i) for i in 1:size(B, Ndims)] >> close(fid) >> >> This works, but the second last line, dset[:,:,i] requires syntax >> specific to writing a slice of a dimension 3 array - but I don't know the >> dimensions until run time. Of course I could just write to a flat binary >> file incrementally, but HDF5.jl could make my life so much simpler! >> > > HDF5 supports "extensible datasets", which were created for use cases such > as this one. I don't recall the exact syntax, but if I recall correctly, > you can specify one dimension (the first one in C, the last one in Julia) > to be extensible, and then you can add more data as you go. You will > probably need to specify a chunk size, which could be the size of the > increment in your case. Given file system speeds, a chunk size smaller than > a few MegaBytes probably don't make much sense (i.e. will slow things down). > > If you want to monitor the HDF5 file as it is being written, look at the > SWIMR feature. This requires HDF5 1.10; unfortunately, Julia will by > default often still install version 1.8. > > If you want to protect against crashes of your code so that you don't lose > progress, then HDF5 is probably not right for you. Once an HDF5 file is > open for writing, the on-disk state might be inconsistent, so that you can > lose all data when your code crashes. In this case, you might want to write > data into different files, one per increment. HDF5 1.0 offers "views", > which are umbrella files that stitch together datasets stored in other > files. > > If you are looking for generic advice for setting up things with HDF5, > then I recommend their documentation. If you are looking for how to access > these features in Julia, or if you notice a feature that is not available > in Julia, then we'll be happy to explain or correct things. > > What do mean by "dimension only known at run time" -- do you mean what > Julia calls "size" (shape) or what Julia calls "dim" (rank)? > > Do all datasets have the same size, or do they differ? If they differ, > then putting them into the same dataset might not make sense; in this case, > I would write them into different datasets. > > -erik > > -- > Erik Schnetter <[email protected]> http://www.perimeterinstitute. > ca/personal/eschnetter/ >
