Hi Leigh,
Sorry for the additional delay, I'm a little swamped with some
contractual stuff and SC-related issues today. I'll get something back to you
tomorrow.
Quincey
On Nov 9, 2012, at 11:36 AM, Leigh Orf <[email protected]> wrote:
> A major part of my I/O strategy for massively parallel supercomputers (such
> as the new Blue Waters Cray XE6 machine) is doing buffered file writes. It
> turns out that our cloud model only takes up a small fraction of the
> available memory on a node, so we can buffer dozens of files to memory before
> we have to hit the file system, dramatically improving I/O wallclock usage.
>
> I am getting some strange behavior with the core driver, however. On some
> machines and with some compilers, it works great. One problem that I am
> having consistently on Blue Waters using the Cray compilers is that the
> amount of memory being chewed up at every h5dwrite is way, way larger than
> the actual size of the data arrays being written. Because I have limited
> access to the machine right now, I have not tested it with other compilers.
>
> Specific example of odd behavior:
>
> First, here is how the data is stored in each file. The output below only
> covers two time levels (there are many more in the file). Note: The group
> 00000 is for time = 0 seconds, the group 00030 is for time = 30 seconds, etc.
>
> h2ologin1:% h5ls -rv cm1out.00000_000000.cm1hdf5 | grep 3d
>
> /00000/3d Group
> /00000/3d/dbz Dataset {250/250, 60/60, 60/60}
> /00000/3d/dissten Dataset {250/250, 60/60, 60/60}
> /00000/3d/khh Dataset {250/250, 60/60, 60/60}
> [...]
> /00020/3d Group
> /00020/3d/dbz Dataset {250/250, 60/60, 60/60}
> /00020/3d/dissten Dataset {250/250, 60/60, 60/60}
> /00020/3d/khh Dataset {250/250, 60/60, 60/60}
> [...]
>
> and so on.
>
> Data is gathered to one of the cores on the 16 core shared memory module so
> only one core per module is buffering to memory and writing to disk. Time
> groups are created, data is written, groups are closed, new groups are
> created, etc. This process goes on until I decide we've used up enough
> memory, and I close the final groups and finally the file with a call to
> h5fclose. Backing store is on, so when the file is closed, its contents are
> flushed to disk. As I understand it, once this is done, all memory that the
> file occupied in memory should be freed.
>
> The problem: In a recent simulation, I wrote 41 3d fields per time level.
> That should mean each time level should take up the following number of bytes:
>
> 250*60*60*41*4 = 150 MB (roughly).
>
> As part of my code, I query the /proc/meminfo (these machines run Linux) file
> on each node to see how much memory is being used / is available, and output
> the values after each buffer to memory. I keep track of what I call
> global_free which is MemFree + Buffers + Cached, and do a MPI_REDUCE, picking
> the smallest value (realizing there will be small variations in memory
> available on each node. However, the results would be nearly identical if I
> just calculated this on any given node)
>
> With no compression and no chunking, I see the following value of global_free
> after each buffered write, which, remember, should be using up around 150 MB:
>
> 0 global_free = 60268020
> 0 global_free = 57186776
> 0 global_free = 53716128
> 0 global_free = 51117500
> 0 global_free = 48013960
> 0 global_free = 44306108
>
> etc. etc.
>
> Those values are in kB - so, for instance, we went from 60.2 GB to 57.1 GB
> (chewed up about 3GB) after writing 150 MB of data!
>
> I do not see this behavior on all machines, and I'm not sure it's a hdf5 bug
> (could be a Cray bug ... and we have submitted a bug report with Cray). But,
> because I have seen flakiness with the core driver beyond this example, and
> there is precious little documentation on it, I wanted to ask whether anyone
> had any ideas on how to troubleshoot this problem. Note, this is with version
> 1.8.8, which is the latest version installed on the Blue Waters machine.
>
> Note that once the file is flushed to disk, its size is exactly what it
> should be based upon the size of the arrays and the data is exactly what it
> should be.
>
> Finally, when I comment out only the h5dwrite command in the 3D write
> subroutine, and leave everything else the same, memory is essentially flat,
> meaning it's not a memory leak on my part. I've experimented with and without
> chunking, and with and without compression. Turning gzip compression on (with
> chunking of course) seems to take up a little less memory per buffered write,
> but still way more than it should.
>
> Here is how I am initializing the files:
>
> backing_store = .true.
> blocksize = 4096
> call h5pcreate_f(H5P_FILE_ACCESS_F, plist_id, ierror); check_err(ierror)
> call h5pset_fapl_core_f(plist_id, blocksize, backing_store, ierror);
> check_err(ierror)
> call
> h5fcreate_f(trim(filename),H5F_ACC_TRUNC_F,file_3d_id,ierror,access_prp=plist_id);
> check_err(ierror)
> call h5pclose_f(plist_id, ierror); check_err(ierror)
>
> I am not calling h5p_set_alignment and cannot recall why I chose 4096 bytes
> for a memory increment size.
>
> Thanks for any pointers.
>
> Leigh
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org