Re: [Pytables-users] trying blosc instead of zlib for compression, but very slow

Koert Kuipers Mon, 27 Sep 2010 11:53:00 -0700

Unfortunately I cannot do that since it is company data. I wrote a simple 
script that queries a file twice:


import sys
import time
import tables
f = tables.openFile(sys.argv[1])
start = time.time()
data = f.root.table.readWhere('field1==2912')
print 'time: %.1f'%(time.time() - start)
print 'nr items: %i'%(len(data))
start = time.time()
data = f.root.table.readWhere(' field1==2912')
print 'time: %.1f'%(time.time() - start)
print 'nr items: %i'%(len(data))

Now I created the same file with lzo, blosc, and zlib compression, each with 2 
chunkshapes (large means chunkshape = (3971,), while small means chunkshape = 
(248,))

I ran the script for each file twice (to detect any operating system file 
buffering). Results:


C:\Devel\work>test_query c:\Devel\data\bigdata_lzo_small.hdf5 
time: 31.8
nr items: 20678
time: 5.9
nr items: 20678

C:\Devel\work>test_query c:\Devel\data\bigdata_lzo_small.hdf5 
time: 5.8
nr items: 20678
time: 5.9
nr items: 20678

C:\Devel\work>test_query c:\Devel\data\bigdata_lzo_large.hdf5 
time: 25.2
nr items: 20678
time: 16.2
nr items: 20678

C:\Devel\work>test_query c:\Devel\data\bigdata_lzo_large.hdf5 
time: 16.0
nr items: 20678
time: 16.5
nr items: 20678

C:\Devel\work>test_query c:\Devel\data\bigdata_blosc_small.hdf5 
time: 46.2
nr items: 20678
time: 4.2
nr items: 20678

C:\Devel\work>test_query c:\Devel\data\bigdata_blosc_small.hdf5 
time: 4.4
nr items: 20678
time: 4.3
nr items: 20678

C:\Devel\work>test_query c:\Devel\data\bigdata_blosc_large.hdf5 
time: 47.9
nr items: 20678
time: 5.3
nr items: 20678

C:\Devel\work>test_query c:\Devel\data\bigdata_blosc_large.hdf5 
time: 5.0
nr items: 20678
time: 5.7
nr items: 20678

C:\Devel\work>test_query c:\Devel\data\bigdata_zlib_small.hdf5 
time: 11.7
nr items: 20678
time: 10.3
nr items: 20678

C:\Devel\work>test_query c:\Devel\data\bigdata_zlib_small.hdf5 
time: 10.3
nr items: 20678
time: 9.9
nr items: 20678

C:\Devel\work>test_query c:\Devel\data\bigdata_zlib_large.hdf5 
time: 24.5
nr items: 20678
time: 24.4
nr items: 20678

C:\Devel\work>test_query c:\Devel\data\bigdata_zlib_large.hdf5 
time: 19.7
nr items: 20678
time: 24.7
nr items: 20678

So small chunkshape is generally better, and blosc is the slowest on the very 
first query, and the fastest after that. This must be an operating system 
issue? The blosc files are much larger, perhaps that plays a role? The files 
for lzo and zlib are 180 - 240Mb, while the files for blosc are 1.1Gb

Koert


-----Original Message-----
From: Francesc Alted [mailto:fal...@pytables.org] 
Sent: September 27 2010 13:05
To: pytables-users@lists.sourceforge.net
Subject: Re: [Pytables-users] trying blosc instead of zlib for compression,but 
very slow

A Monday 27 September 2010 17:50:55 Koert Kuipers escrigué:
> Hi all,
> 
> I have a table that looks like this:
> 
> /table (Table(17801755,), shuffle, zlib(1)) ''
>   description := {
>   "field1": UInt32Col(shape=(), dflt=0, pos=0),
>   "field2": UInt32Col(shape=(), dflt=0, pos=1),
>   "field3": Float64Col(shape=(), dflt=0.0, pos=2),
>   "field4": Float32Col(shape=(), dflt=0.0, pos=3),
>   "field5": Float32Col(shape=(), dflt=0.0, pos=4),
>   "field6": Float32Col(shape=(), dflt=0.0, pos=5),
>   "field7": Float32Col(shape=(), dflt=0.0, pos=6),
>   "field8": Float32Col(shape=(), dflt=0.0, pos=7),
>   "field9": Float32Col(shape=(), dflt=0.0, pos=8),
>   "field10": Float32Col(shape=(), dflt=0.0, pos=9),
>   "field11": UInt16Col(shape=(), dflt=0, pos=10),
>   "field12": UInt16Col(shape=(), dflt=0, pos=11),
>   "field13": UInt16Col(shape=(), dflt=0, pos=12),
>   "field14": UInt16Col(shape=(), dflt=0, pos=13),
>   "field15": Float64Col(shape=(), dflt=0.0, pos=14),
>   "field16": Float32Col(shape=(), dflt=0.0, pos=15),
>   "field17": UInt16Col(shape=(), dflt=0, pos=16)}
>   byteorder := 'little'
>   chunkshape := (248,)
> 
> when I run a query on it this is the result:
> >>> start=time.time(); data=f.root.table.readWhere('field1==2912');
> >>> print time.time()-start
> 
> 11.0780000687
> 
> >>> len(data)
> 
> 20678
> 
> I wanted to speed up this sort of querying, so created a new table
> with blosc compression and copied the data. My old table has
> expectedrows = 1000000, but since reality turned out to be a lot
> more data I also updated expectedrows to 10000000
> 
> /table1 (Table(17801755,), shuffle, blosc(1)) ''
>   description := {
>   "field1": UInt32Col(shape=(), dflt=0, pos=0),
>   "field2": UInt32Col(shape=(), dflt=0, pos=1),
>   "field3": Float64Col(shape=(), dflt=0.0, pos=2),
>   "field4": Float32Col(shape=(), dflt=0.0, pos=3),
>   "field5": Float32Col(shape=(), dflt=0.0, pos=4),
>   "field6": Float32Col(shape=(), dflt=0.0, pos=5),
>   "field7": Float32Col(shape=(), dflt=0.0, pos=6),
>   "field8": Float32Col(shape=(), dflt=0.0, pos=7),
>   "field9": Float32Col(shape=(), dflt=0.0, pos=8),
>   "field10": Float32Col(shape=(), dflt=0.0, pos=9),
>   "field11": UInt16Col(shape=(), dflt=0, pos=10),
>   "field12": UInt16Col(shape=(), dflt=0, pos=11),
>   "field13": UInt16Col(shape=(), dflt=0, pos=12),
>   "field14": UInt16Col(shape=(), dflt=0, pos=13),
>   "field15": Float64Col(shape=(), dflt=0.0, pos=14),
>   "field16": Float32Col(shape=(), dflt=0.0, pos=15),
>   "field17": UInt16Col(shape=(), dflt=0, pos=16)}
>   byteorder := 'little'
>   chunkshape := (3971,)
> 
> >>> start=time.time(); data=f.root.table1.readWhere('field1==2912');
> >>> print time.time()-start
> 
> 115.51699996
> 
> >>> len(data)
> 
> 20678
> 
> Not exactly what I expected! I am obviously doing something wrong.
> Any suggestions? Thanks, Koert

Certainly surprising.  Can you put your datafile on a public place so 
that I can experiment with it?

Thanks,

-- 
Francesc Alted

------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] trying blosc instead of zlib for compression, but very slow

Reply via email to