Hi Rob, A Friday 08 January 2010 16:10:01 Rob Latham escrigué: > On Thu, Jan 07, 2010 at 08:30:57PM +0100, Francesc Alted wrote: > > What I want to stress during the workshop is the dependency of I/O > > throughput on the chunksize for a certain dataset. For making the plots > > that I've got (attached), I have chosen a dataset of 2 GB (2-dim, shape > > is (512, 65536) and datatype is double precision) so that it can easily > > fit into my OS cache memory (my machine has 8 GB) and make the effects > > clearer. In the X axis, I represent the chunksize for every dataset > > (from 1 KB up to 8 MB). In the Y axis there is the performance for > > reading the dataset sequentially. > > I'd appreciate a bit more explanation of your methodology. You want > to test *I/O throughput* but at the same time you want to make sure > the data fits in memory cache. Are you not then just testing memory > bandwidth?
Nope. I'd like to characterize a situation where I can maximize both sequential and 'semi-random' access to a file. By 'semi-random' I mean an access mode that performs random access in a certain row and then repeat the operation with other rows. As the shape of my 2 GB dataset is (512, 524288) --the stated shape in my previous message was wrong, sorry--, I need at least 4 MB for the chunk cache size so as to maximize the access time in such a 'semi-random' mode. I'm attaching a couple of plots where it is shown how the 8 MB cache works much better than the default cache size of 1 MB. Unfortunately, choosing 8 MB does have an important impact in the sequential access mode, as explained in the OP. The thing is that I don't completely understand why this is so (i.e. I don't understand well how the HDF5 chunk cache works ;-). I'm attaching the script that I'm using for this case, if that helps to clarify things. It is made in Python/PyTables, but I think it is simple enough to be understood, at least at high level. > If I were running this benchmark I would be purging the memory cache > between every run: the chunk cache is designed to improve disk > performance, right? That's an interesting idea. How the HDF5 chunk cache can be purged? However, my latest profiles don't suggest this could help. I've sent these profiles to this list, but as they are screenshots of the kcachegrind graphical tool, the total size of the message exceeds what is allowed in this forum. I'll try to reduce the message size and send it again. Thanks, -- Francesc Alted
random-1MB.pdf
Description: Adobe PDF document
random-8MB.pdf
Description: Adobe PDF document
"""Small benchmark on the effect of chunksizes and compression on HDF5 files. Francesc Alted 2007-11-25 """ import os, math, subprocess from time import time import numpy import tables # Size of dataset #N, M = 512, 2**16 # 256 MB #N, M = 512, 2**18 # 1 GB N, M = 512, 2**19 # 2 GB #N, M = 2000, 1000000 # 15 GB #N, M = 4000, 1000000 # 30 GB datom = tables.Float64Atom() # elements are double precision def quantize(data, least_significant_digit): """quantize data to improve compression. data is quantized using around(scale*data)/scale, where scale is 2**bits, and bits is determined from the least_significant_digit. For example, if least_significant_digit=1, bits will be 4.""" precision = 10.**-least_significant_digit exp = math.log(precision,10) if exp < 0: exp = int(math.floor(exp)) else: exp = int(math.ceil(exp)) bits = math.ceil(math.log(10.**-exp,2)) scale = 2.**bits return numpy.around(scale*data)/scale def get_db_size(filename): sout = subprocess.Popen("ls -sh %s" % filename, shell=True, stdout=subprocess.PIPE).stdout line = [l for l in sout][0] return line.split()[0] def bench(chunkshape, filters): numpy.random.seed(1) # to have reproductible results #filename = '/oldScr/ivilata/pytables/data.nobackup/test.h5' filename = '/scratch2/faltet/data.nobackup/test.h5' #filename = '/scratch1/faltet/test.h5' f = tables.openFile(filename, 'w') e = f.createEArray(f.root, 'earray', datom, shape=(0, M), filters = filters, chunkshape = chunkshape) # Fill the array t1 = time() for i in xrange(N): #e.append([numpy.random.rand(M)]) # use this for less compressibility e.append([quantize(numpy.random.rand(M), 6)]) #os.system("sync") # needs to be root print "Creation time:", round(time()-t1, 3), filesize = get_db_size(filename) filesize_bytes = os.stat(filename)[6] print "\t\tFile size: %d -- (%s)" % (filesize_bytes, filesize) # Read in sequential mode: e = f.root.earray t1 = time() # Flush everything to disk and flush caches #os.system("sync; echo 1 > /proc/sys/vm/drop_caches") # root for row in e: t = row print "Sequential read time:", round(time()-t1, 3), f.close() return # Read in random mode: i_index = numpy.random.randint(0, N, 128) j_index = numpy.random.randint(0, M, 256) # Flush everything to disk and flush caches #os.system("sync; echo 1 > /proc/sys/vm/drop_caches") # root t1 = time() for i in i_index: for j in j_index: t = e[i,j] print "\tRandom read time:", round(time()-t1, 3) f.close() # Benchmark with different chunksizes and filters #for complib in (None, 'zlib', 'lzo', 'blosc'): # needs 'lzo' and 'blosc' for complib in (None, 'zlib'): if complib: filters = tables.Filters(complevel=5, complib=complib) else: filters = tables.Filters(complevel=0) print "8<--"*20, "\nFilters:", filters, "\n"+"-"*80 for ecs in range(10, 24): chunksize = 2**ecs chunk1 = 1 chunk2 = chunksize/datom.itemsize if chunk2 > M: chunk1 = chunk2 / M chunk2 = M chunkshape = (chunk1, chunk2) cs_str = str(chunksize / 1024) + " KB" print "***** Chunksize:", cs_str, "/ Chunkshape:", chunkshape, "*****" bench(chunkshape, filters)
_______________________________________________ Hdf-forum is for HDF software users discussion. Hdf-forum@hdfgroup.org http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org