Hi Rob,

A Friday 08 January 2010 16:10:01 Rob Latham escrigué:
> On Thu, Jan 07, 2010 at 08:30:57PM +0100, Francesc Alted wrote:
> > What I want to stress during the workshop is the dependency of I/O
> > throughput on the chunksize for a certain dataset.  For making the plots
> > that I've got (attached), I have chosen a dataset of 2 GB (2-dim, shape
> > is (512, 65536) and datatype is double precision) so that it can easily
> > fit into my OS cache memory (my machine has 8 GB) and make the effects
> > clearer.  In the X axis, I represent the chunksize for every dataset
> > (from 1 KB up to 8 MB).  In the Y axis there is the performance for
> > reading the dataset sequentially.
> 
> I'd appreciate a bit more explanation of your methodology.  You want
> to test *I/O throughput* but at the same time you want to make sure
> the data fits in memory cache.  Are you not then just testing memory
> bandwidth?

Nope.  I'd like to characterize a situation where I can maximize both 
sequential and 'semi-random' access to a file.  By 'semi-random' I mean an 
access mode that performs random access in a certain row and then repeat the 
operation with other rows.  As the shape of my 2 GB dataset is (512, 524288) 
--the stated shape in my previous message was wrong, sorry--, I need at least 
4 MB for the chunk cache size so as to maximize the access time in such a 
'semi-random' mode.  I'm attaching a couple of plots where it is shown how the 
8 MB cache works much better than the default cache size of 1 MB.

Unfortunately, choosing 8 MB does have an important impact in the sequential 
access mode, as explained in the OP.  The thing is that I don't completely 
understand why this is so (i.e. I don't understand well how the HDF5 chunk 
cache works ;-).

I'm attaching the script that I'm using for this case, if that helps to 
clarify things.  It is made in Python/PyTables, but I think it is simple 
enough to be understood, at least at high level.

> If I were running this benchmark I would be purging the memory cache
> between every run: the chunk cache is designed to improve disk
> performance, right?

That's an interesting idea.  How the HDF5 chunk cache can be purged?  However, 
my latest profiles don't suggest this could help.  I've sent these profiles to 
this list, but as they are screenshots of the kcachegrind graphical tool, the 
total size of the message exceeds what is allowed in this forum.  I'll try to 
reduce the message size and send it again.

Thanks,

-- 
Francesc Alted

Attachment: random-1MB.pdf
Description: Adobe PDF document

Attachment: random-8MB.pdf
Description: Adobe PDF document

"""Small benchmark on the effect of chunksizes and compression on HDF5 files.

Francesc Alted
2007-11-25
"""

import os, math, subprocess
from time import time
import numpy
import tables

# Size of dataset
#N, M = 512, 2**16     # 256 MB
#N, M = 512, 2**18     # 1 GB
N, M = 512, 2**19     # 2 GB
#N, M = 2000, 1000000  # 15 GB
#N, M = 4000, 1000000  # 30 GB
datom = tables.Float64Atom()   # elements are double precision


def quantize(data, least_significant_digit):
    """quantize data to improve compression.

    data is quantized using around(scale*data)/scale, where scale is
    2**bits, and bits is determined from the least_significant_digit.
    For example, if least_significant_digit=1, bits will be 4."""

    precision = 10.**-least_significant_digit
    exp = math.log(precision,10)
    if exp < 0:
        exp = int(math.floor(exp))
    else:
        exp = int(math.ceil(exp))
    bits = math.ceil(math.log(10.**-exp,2))
    scale = 2.**bits
    return numpy.around(scale*data)/scale


def get_db_size(filename):
    sout = subprocess.Popen("ls -sh %s" % filename, shell=True,
                            stdout=subprocess.PIPE).stdout
    line = [l for l in sout][0]
    return line.split()[0]


def bench(chunkshape, filters):
    numpy.random.seed(1)   # to have reproductible results
    #filename = '/oldScr/ivilata/pytables/data.nobackup/test.h5'
    filename = '/scratch2/faltet/data.nobackup/test.h5'
    #filename = '/scratch1/faltet/test.h5'

    f = tables.openFile(filename, 'w')
    e = f.createEArray(f.root, 'earray', datom, shape=(0, M),
                       filters = filters,
                       chunkshape = chunkshape)
    # Fill the array
    t1 = time()
    for i in xrange(N):
        #e.append([numpy.random.rand(M)])  # use this for less compressibility
        e.append([quantize(numpy.random.rand(M), 6)])
    #os.system("sync")   # needs to be root
    print "Creation time:", round(time()-t1, 3),
    filesize = get_db_size(filename)
    filesize_bytes = os.stat(filename)[6]
    print "\t\tFile size: %d -- (%s)" % (filesize_bytes, filesize)

    # Read in sequential mode:
    e = f.root.earray
    t1 = time()
    # Flush everything to disk and flush caches
    #os.system("sync; echo 1 > /proc/sys/vm/drop_caches")  # root
    for row in e:
        t = row
    print "Sequential read time:", round(time()-t1, 3),

    f.close()
    return

    # Read in random mode:
    i_index = numpy.random.randint(0, N, 128)
    j_index = numpy.random.randint(0, M, 256)
    # Flush everything to disk and flush caches
    #os.system("sync; echo 1 > /proc/sys/vm/drop_caches")  # root

    t1 = time()
    for i in i_index:
        for j in j_index:
            t = e[i,j]
    print "\tRandom read time:", round(time()-t1, 3)

    f.close()

# Benchmark with different chunksizes and filters
#for complib in (None, 'zlib', 'lzo', 'blosc'):  # needs 'lzo' and 'blosc'
for complib in (None, 'zlib'):
    if complib:
        filters = tables.Filters(complevel=5, complib=complib)
    else:
        filters = tables.Filters(complevel=0)
    print "8<--"*20, "\nFilters:", filters, "\n"+"-"*80
    for ecs in range(10, 24):
        chunksize = 2**ecs
        chunk1 = 1
        chunk2 = chunksize/datom.itemsize
        if chunk2 > M:
            chunk1 = chunk2 / M
            chunk2 = M
        chunkshape = (chunk1, chunk2)
        cs_str = str(chunksize / 1024) + " KB"
        print "***** Chunksize:", cs_str, "/ Chunkshape:", chunkshape, "*****"
        bench(chunkshape, filters)
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to