Re: [julia-users] Advice on (perhaps) chunking to HDF5

Ralph Smith Tue, 13 Sep 2016 20:46:59 -0700

I have better luck with 

inds = fill(:,3)


By the way, if anyone appropriate is watching, can we have a sticky post 
about how to format Julia code here?
And is the comprehension form of a one-line "for" loop considered good 
style? I don't see it in the manual anywhere.

On Tuesday, September 13, 2016 at 9:36:58 PM UTC-4, sparrowhawker wrote:
>
> Cool! The colons approach makes sense to me, followed by splatting.
>
> I'm unfamiliar with the syntax here but when I try to create a tuple in 
> the REPL using
>
> inds = ((:) for i in 1:3)
>
> I get 
> ERROR: syntax: missing separator in tuple
>
>
>
> On 13 September 2016 at 17:27, Erik Schnetter <[email protected] 
> <javascript:>> wrote:
>
>> If you have a varying rank, then you should probably use something like 
>> `CartesianIndex` and `CartesianRange` to represent the indices, or possible 
>> tuples of integers. You would then use the splatting operator to create the 
>> indexing instructions:
>>
>> ```Julia
>> indrange = CartesianRange(xyz)
>> dset[indrange..., i] = slicedim
>> ```
>>
>> I don't know whether the expression `indrange...` works as-is, or whether 
>> you have to manually create a tuple of `UnitRange`s.
>>
>> If you want to use colons, then you'd write
>>
>> ```Julia
>> inds = ((:) for i in 1:rank)
>> dset[inds..., i] = xyz
>> ```
>>
>> -erik
>>
>>
>>
>>
>> On Tue, Sep 13, 2016 at 5:08 PM, Anandaroop Ray <[email protected] 
>> <javascript:>> wrote:
>>
>>> Many thanks for your comprehensive recommendations. I think HDF5 views 
>>> are probably what I need to go with - will read up more and then ask.
>>>
>>> What I mean about dimension is rank, really. The shape is always the 
>>> same for all samples. One slice for storage, i.e., one sample, could be 
>>> chunked as dset[:,:,i] or dset[:,:,:,:,i] but always of the form, 
>>> dset[:,...,:i], depending on input to the code at runtime.
>>>
>>> Thanks
>>>
>>> On 13 September 2016 at 14:47, Erik Schnetter <[email protected] 
>>> <javascript:>> wrote:
>>>
>>>> On Tue, Sep 13, 2016 at 11:27 AM, sparrowhawker <[email protected] 
>>>> <javascript:>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm new to Julia, and have been able to accomplish a lot of what I 
>>>>> used to do in Matlab/Fortran, in very little time since I started using 
>>>>> Julia in the last three months. Here's my newest stumbling block.
>>>>>
>>>>> I have a process which creates nsamples within a loop. Each sample 
>>>>> takes a long time to compute as there are expensive finite difference 
>>>>> operations, which ultimately lead to a sample, say 1 to 10 seconds. I 
>>>>> have 
>>>>> to store each of the nsamples, and I know the size and dimensions of each 
>>>>> of the nsamples (all samples have the same size and dimensions). However, 
>>>>> depending on the run time parameters, each sample may be a 32x32 image or 
>>>>> perhaps a 64x64x64 voxset with 3 attributes, i.e., a 64x64x64x3 
>>>>> hyper-rectangle. To be clear, each sample can be an arbitrary dimension 
>>>>> hyper-rectangle, specified at run time.
>>>>>
>>>>> Obviously, since I don't want to lose computation and want to see 
>>>>> incremental progress, I'd like to do incremental saves of these samples 
>>>>> on 
>>>>> disk, instead of waiting to collect all nsamples at the end. For 
>>>>> instance, 
>>>>> if I had to store 1000 samples of size 64x64, I thought perhaps I could 
>>>>> chunk and save 64x64 slices to an HDF5 file 1000 times. Is this the right 
>>>>> approach? If so, here's a prototype program to do so, but it depends on 
>>>>> my 
>>>>> knowing the number of dimensions of the slice, which is not known until 
>>>>> runtime,
>>>>>
>>>>> using HDF5
>>>>>
>>>>> filename = "test.h5"
>>>>> # open file
>>>>> fmode ="w"
>>>>> # get a file object
>>>>> fid = h5open(filename, fmode)
>>>>> # matrix to write in chunks
>>>>> B = rand(64,64,1000)
>>>>> # figure out its dimensions
>>>>> sizeTuple = size(B)
>>>>> Ndims = length(sizeTuple)
>>>>> # set up to write in chunks of sizeArray
>>>>> sizeArray = ones(Int, Ndims)
>>>>> [sizeArray[i] = sizeTuple[i] for i in 1:(Ndims-1)] # last value of 
>>>>> size array is :...:,1
>>>>> # create a dataset models within root
>>>>> dset = d_create(fid, "models", datatype(Float64), dataspace(size(B)), 
>>>>> "chunk", sizeArray)
>>>>> [dset[:,:,i] = slicedim(B, Ndims, i) for i in 1:size(B, Ndims)]
>>>>> close(fid)
>>>>>
>>>>> This works, but the second last line, dset[:,:,i] requires syntax 
>>>>> specific to writing a slice of a dimension 3 array - but I don't know the 
>>>>> dimensions until run time. Of course I could just write to a flat binary 
>>>>> file incrementally, but HDF5.jl could make my life so much simpler!
>>>>>
>>>>
>>>> HDF5 supports "extensible datasets", which were created for use cases 
>>>> such as this one. I don't recall the exact syntax, but if I recall 
>>>> correctly, you can specify one dimension (the first one in C, the last one 
>>>> in Julia) to be extensible, and then you can add more data as you go. You 
>>>> will probably need to specify a chunk size, which could be the size of the 
>>>> increment in your case. Given file system speeds, a chunk size smaller 
>>>> than 
>>>> a few MegaBytes probably don't make much sense (i.e. will slow things 
>>>> down).
>>>>
>>>> If you want to monitor the HDF5 file as it is being written, look at 
>>>> the SWIMR feature. This requires HDF5 1.10; unfortunately, Julia will by 
>>>> default often still install version 1.8.
>>>>
>>>> If you want to protect against crashes of your code so that you don't 
>>>> lose progress, then HDF5 is probably not right for you. Once an HDF5 file 
>>>> is open for writing, the on-disk state might be inconsistent, so that you 
>>>> can lose all data when your code crashes. In this case, you might want to 
>>>> write data into different files, one per increment. HDF5 1.0 offers 
>>>> "views", which are umbrella files that stitch together datasets stored in 
>>>> other files.
>>>>
>>>> If you are looking for generic advice for setting up things with HDF5, 
>>>> then I recommend their documentation. If you are looking for how to access 
>>>> these features in Julia, or if you notice a feature that is not available 
>>>> in Julia, then we'll be happy to explain or correct things.
>>>>
>>>> What do mean by "dimension only known at run time" -- do you mean what 
>>>> Julia calls "size" (shape) or what Julia calls "dim" (rank)?
>>>>
>>>> Do all datasets have the same size, or do they differ? If they differ, 
>>>> then putting them into the same dataset might not make sense; in this 
>>>> case, 
>>>> I would write them into different datasets.
>>>>
>>>> -erik
>>>>
>>>> -- 
>>>> Erik Schnetter <[email protected] <javascript:>> 
>>>> http://www.perimeterinstitute.ca/personal/eschnetter/
>>>>
>>>
>>>
>>
>>
>> -- 
>> Erik Schnetter <[email protected] <javascript:>> 
>> http://www.perimeterinstitute.ca/personal/eschnetter/
>>
>
>

Re: [julia-users] Advice on (perhaps) chunking to HDF5

Reply via email to