Unfortunately I cannot do that since it is company data. I wrote a simple script that queries a file twice:
import sys import time import tables f = tables.openFile(sys.argv[1]) start = time.time() data = f.root.table.readWhere('field1==2912') print 'time: %.1f'%(time.time() - start) print 'nr items: %i'%(len(data)) start = time.time() data = f.root.table.readWhere(' field1==2912') print 'time: %.1f'%(time.time() - start) print 'nr items: %i'%(len(data)) Now I created the same file with lzo, blosc, and zlib compression, each with 2 chunkshapes (large means chunkshape = (3971,), while small means chunkshape = (248,)) I ran the script for each file twice (to detect any operating system file buffering). Results: C:\Devel\work>test_query c:\Devel\data\bigdata_lzo_small.hdf5 time: 31.8 nr items: 20678 time: 5.9 nr items: 20678 C:\Devel\work>test_query c:\Devel\data\bigdata_lzo_small.hdf5 time: 5.8 nr items: 20678 time: 5.9 nr items: 20678 C:\Devel\work>test_query c:\Devel\data\bigdata_lzo_large.hdf5 time: 25.2 nr items: 20678 time: 16.2 nr items: 20678 C:\Devel\work>test_query c:\Devel\data\bigdata_lzo_large.hdf5 time: 16.0 nr items: 20678 time: 16.5 nr items: 20678 C:\Devel\work>test_query c:\Devel\data\bigdata_blosc_small.hdf5 time: 46.2 nr items: 20678 time: 4.2 nr items: 20678 C:\Devel\work>test_query c:\Devel\data\bigdata_blosc_small.hdf5 time: 4.4 nr items: 20678 time: 4.3 nr items: 20678 C:\Devel\work>test_query c:\Devel\data\bigdata_blosc_large.hdf5 time: 47.9 nr items: 20678 time: 5.3 nr items: 20678 C:\Devel\work>test_query c:\Devel\data\bigdata_blosc_large.hdf5 time: 5.0 nr items: 20678 time: 5.7 nr items: 20678 C:\Devel\work>test_query c:\Devel\data\bigdata_zlib_small.hdf5 time: 11.7 nr items: 20678 time: 10.3 nr items: 20678 C:\Devel\work>test_query c:\Devel\data\bigdata_zlib_small.hdf5 time: 10.3 nr items: 20678 time: 9.9 nr items: 20678 C:\Devel\work>test_query c:\Devel\data\bigdata_zlib_large.hdf5 time: 24.5 nr items: 20678 time: 24.4 nr items: 20678 C:\Devel\work>test_query c:\Devel\data\bigdata_zlib_large.hdf5 time: 19.7 nr items: 20678 time: 24.7 nr items: 20678 So small chunkshape is generally better, and blosc is the slowest on the very first query, and the fastest after that. This must be an operating system issue? The blosc files are much larger, perhaps that plays a role? The files for lzo and zlib are 180 - 240Mb, while the files for blosc are 1.1Gb Koert -----Original Message----- From: Francesc Alted [mailto:fal...@pytables.org] Sent: September 27 2010 13:05 To: pytables-users@lists.sourceforge.net Subject: Re: [Pytables-users] trying blosc instead of zlib for compression,but very slow A Monday 27 September 2010 17:50:55 Koert Kuipers escrigué: > Hi all, > > I have a table that looks like this: > > /table (Table(17801755,), shuffle, zlib(1)) '' > description := { > "field1": UInt32Col(shape=(), dflt=0, pos=0), > "field2": UInt32Col(shape=(), dflt=0, pos=1), > "field3": Float64Col(shape=(), dflt=0.0, pos=2), > "field4": Float32Col(shape=(), dflt=0.0, pos=3), > "field5": Float32Col(shape=(), dflt=0.0, pos=4), > "field6": Float32Col(shape=(), dflt=0.0, pos=5), > "field7": Float32Col(shape=(), dflt=0.0, pos=6), > "field8": Float32Col(shape=(), dflt=0.0, pos=7), > "field9": Float32Col(shape=(), dflt=0.0, pos=8), > "field10": Float32Col(shape=(), dflt=0.0, pos=9), > "field11": UInt16Col(shape=(), dflt=0, pos=10), > "field12": UInt16Col(shape=(), dflt=0, pos=11), > "field13": UInt16Col(shape=(), dflt=0, pos=12), > "field14": UInt16Col(shape=(), dflt=0, pos=13), > "field15": Float64Col(shape=(), dflt=0.0, pos=14), > "field16": Float32Col(shape=(), dflt=0.0, pos=15), > "field17": UInt16Col(shape=(), dflt=0, pos=16)} > byteorder := 'little' > chunkshape := (248,) > > when I run a query on it this is the result: > >>> start=time.time(); data=f.root.table.readWhere('field1==2912'); > >>> print time.time()-start > > 11.0780000687 > > >>> len(data) > > 20678 > > I wanted to speed up this sort of querying, so created a new table > with blosc compression and copied the data. My old table has > expectedrows = 1000000, but since reality turned out to be a lot > more data I also updated expectedrows to 10000000 > > /table1 (Table(17801755,), shuffle, blosc(1)) '' > description := { > "field1": UInt32Col(shape=(), dflt=0, pos=0), > "field2": UInt32Col(shape=(), dflt=0, pos=1), > "field3": Float64Col(shape=(), dflt=0.0, pos=2), > "field4": Float32Col(shape=(), dflt=0.0, pos=3), > "field5": Float32Col(shape=(), dflt=0.0, pos=4), > "field6": Float32Col(shape=(), dflt=0.0, pos=5), > "field7": Float32Col(shape=(), dflt=0.0, pos=6), > "field8": Float32Col(shape=(), dflt=0.0, pos=7), > "field9": Float32Col(shape=(), dflt=0.0, pos=8), > "field10": Float32Col(shape=(), dflt=0.0, pos=9), > "field11": UInt16Col(shape=(), dflt=0, pos=10), > "field12": UInt16Col(shape=(), dflt=0, pos=11), > "field13": UInt16Col(shape=(), dflt=0, pos=12), > "field14": UInt16Col(shape=(), dflt=0, pos=13), > "field15": Float64Col(shape=(), dflt=0.0, pos=14), > "field16": Float32Col(shape=(), dflt=0.0, pos=15), > "field17": UInt16Col(shape=(), dflt=0, pos=16)} > byteorder := 'little' > chunkshape := (3971,) > > >>> start=time.time(); data=f.root.table1.readWhere('field1==2912'); > >>> print time.time()-start > > 115.51699996 > > >>> len(data) > > 20678 > > Not exactly what I expected! I am obviously doing something wrong. > Any suggestions? Thanks, Koert Certainly surprising. Can you put your datafile on a public place so that I can experiment with it? Thanks, -- Francesc Alted ------------------------------------------------------------------------------ Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing. http://p.sf.net/sfu/novell-sfdev2dev _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users ------------------------------------------------------------------------------ Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing. http://p.sf.net/sfu/novell-sfdev2dev _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users