[also sent to [email protected]] I have been heavily utilizing the core driver for serial HDF5, lately on the Blue Waters supercomputer. I am having some perplexing performance issues that are causing me to seriously reconsider my use of this approach, which allows hdf5 files to be buffered to memory and flushed to disk at a later time (avoiding hitting the filesystem frequently).
I am doing large-scale cloud modeling, saving (primarily) three-dimensional floating point arrays in equally-spaced time intervals (say, every 5 seconds of model integration). My model writes thousands of HDF5 files concurrently, one core from each shared-memory node tasked with the job of making the hdf5 calls. After, say, 50 buffered writes, the model closes the file (backing store is on) and the files are flushed to disk. I am not having problems with the actual flush to disk performance, but the buffering writes themselves. Lately I have run into a perplexing problem. I have been saving data much more frequently than usual, as I need very high temporal resolution for the current study I am doing. What I am seeing: Initially, shortly after the model starts running, the I/O section of the code where each I/O core writes to memory (using the core driver), performance is very good, what you would expect when doing I/O that is all in memory. After the model has run a while and done a couple of flushes to disk, I noticed a couple of things. First, the amount of memory that is being utilized by HDF5 increases with time, even though we have ostensibly freed it up after the files have been written to disk. I keep tabs on /proc/meminfo on each node and look at things like available memory, active memory, buffered memory utilization, etc. What I have found is that a whole lot of memory is never completely freed up after files are written to disk. I've also noticed that there is a huge memory overhead with the core driver. I may be buffering, say, 4 GB worth of 3D floating point arrays to memory, but something like 5-10 times more memory is being used by HDF5 (the model itself allocates all of its memory up front - so aside from, perhaps, MPI there is no alloc/dealloc going on in the model other than what hdf5 is doing). Even though I see some memory recovery after flushing to disk - and the Linux kernel may be partly at fault here - I have run into OOM issues where the model is killed because I have run out of memory (and this is after buffering, say, 4 GB of writes to memory when the machine has 64 GB of memory to play with). The only way I've figured around this problem is to buffer a lot less data to memory than I really want to. That is one major issue I am having. Now, on the performance issue. I have written an unpublished technical document that describes my I/O strategy. Page 4 is pseudocode for the I/O cycle I use (see here: http://orf5.com/bw/cm1tools-March2013.pdf). Essentially, I create a new top level group (zero padded integer which represents the time we are at in the model), a subgroup (called 3D - right now there is only one subgroup here) - and finally, a subgroup below that that is named after the actual data being stored - and there are usually 10 or so floating point arrays. Over time, of course, the number of groups grows - but to what I think are manageable numbers - we're talking hundreds of groups per file, not tens of thousands. Over time, the time it takes to do buffered writes (I assume this is happening in H5Dwrite) dramatically increases. I have not done profiling, but I have watched this, as I send unbuffered "heartbeat" information to standard out during the model simulation. It takes, say, 4-5 times longer to do I/O (and remember, this is the buffered write section! We are not actually writing to disk here!) as the model progresses. This is unacceptable, as time is a precious commodity on a supercomputer! I do have chunking and gzip compression (level 1) turned on for my floating point arrays. I am choosing what I think are logical chunking parameters. I simply choose as the chunk dimensions the original array dimensions that each core operates on. So, if I am writing an array on a node that that is 160x160x100 and I have 16 cores (in a 4x4 orientation), I just collect data to the I/O core and set the chunking dimensions to be 40x40x100. I am getting to the point where I am seriously considering just doing my own buffering and not using the core driver. This way, I would allocate all of the buffer space up front with a regular F95 ALLOCATE call, and then at I/O time, just loop through and blast everything to disk, using the same exact group structure that I currently use. Before I go through the trouble of doing this, however, I am wanting to see whether there is a way around my problems (that doesn't involve doing exhaustive profiling; I just don't have time to figure out how to profile HDF5 right now), or at least some sort of confidence that my problems won't continue once I stop using the core driver. The core driver is really neat and I like it but these weird issues with memory bloat and now strange performance issues that are a function of how long the model has been running have me wishing to try something else. FYI, I have tried h5_garbage_collect with no discernible performance change. Also: blocksize = 1024*1024*100 CALL h5pset_fapl_core_f(plist_id, blocksize, backing_store, ierror); I have played with different blocksizes before. Because I am using compression and historically I have not always saved the same number of time levels to each file, I am not always sure how large my data will be - so I have chosen a block size of 100 MB which seems like a good compromise between being too large and too small - but I really don't completely understand the function of this setting and perhaps it has something to do with the memory issues (which is another reason why I am leaning towards doing my own buffering since this is not something that needs to be set with the standard driver). I did notice if I chose a much larger block size, I ended up with huge amounts of padding tacked on to the end of the written hdf5 file. Leigh -- Leigh Orf Chair, Department of Earth and Atmospheric Sciences Central Michigan University
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
