On Tuesday 12 May 2009 12:29:04 Armando Serrano Lombillo wrote:
> Compression: zlib, level 1.
> Size: 150 MB (compressed) but it could be even bigger, or it could be less
> than 1 MB. Anyway, even with small files, I find it slower than I would
> expect.
> Available memory: depends. I am now running it with 512 MB of RAM.
> Expectedrows: no, I didn't know about it.
> Other information: I first create the table, save the file and close it. At
> the time of creating the table I can't now how many rows there will be. I
> then reopen the file and extract the unique values.
> I'm running it on Windows XP, python 2.5.
Well, 150 MB is not really a big deal for PyTables. I think the problem is
more the way you are accessing columns. The next is quite more efficient:
# Build a dictionary for getting the different values per column
uniqvals = dict((name, set()) for name in t.colnames)
# Get the unique values for every column
for row in t:
for column in uniqvals:
uniqvals[column].add(row[column])
Here you have some timings for a really small table (400 KB):
Creating table with 10000 rows and 5 columns...
Getting unique values for each column (method 1)...
column: col0, unique values: set([0])
column: col1, unique values: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
column: col2, unique values: set([0, 8, 2, 4, 6])
column: col3, unique values: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
column: col4, unique values: set([0, 8, 2, 4, 6])
Time for finding unique values (method 1): 4.928
Getting unique values for each column (method 2)...
column: col0, unique values: set([0])
column: col1, unique values: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
column: col2, unique values: set([0, 8, 2, 4, 6])
column: col3, unique values: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
column: col4, unique values: set([0, 8, 2, 4, 6])
Time for finding unique values (method 2): 0.017
where method 1 is what you are using and method 2 is the above one. As you
can see, the speed-up is more than 250x.
Also, using a dictionary instead of a set could be worth a try:
# Build a dictionary for getting the different values per column
uniqvals = dict((name, {}) for name in t.colnames)
# Get the unique values for every column
for row in t:
for column in uniqvals:
uniqvals[column][row[column]] = None
this method (let's call it method 3) is a little faster:
Creating table with 1000000 rows and 10 columns...
Getting unique values for each column (method 2)...
Time for finding unique values (method 2): 2.563
Getting unique values for each column (method 3)...
Time for finding unique values (method 3): 2.032
This last run was using a 40 MB table. So, this operation can be done at
around 20 MB/s is which pretty good.
I'm attaching my small benchmark in case you you want to experiment a little
more with different compressors or other parameters.
Hope this helps,
--
Francesc Alted
import tables as tb
from time import time
FNAME = "/tmp/prova.h5"
NROWS = int(1e6)
NCOLS = 10
MAXDIFF = 10
def createtable(filename):
# Build a table descriptor as a dictionary
tdesc = {}
for i in range(NCOLS):
tdesc["col%d"%i] = tb.IntCol()
f = tb.openFile(filename, "w")
t = f.createTable(f.root, 'table', tdesc, #expectedrows=NROWS,
filters=tb.Filters(complib='lzo', complevel=1))
# Fill the table with values
row = t.row
for j in range(NROWS):
for i in range(NCOLS):
row["col%d"%i] = j*i % MAXDIFF # MAXDIFF different values
row.append()
f.close()
def tuniq1(filename):
f = tb.openFile(filename, "r")
t = f.root.table
# Get the different values per column in one shot
uniqvals = dict((name, set(t.cols._f_col(name))) for name in t.colnames)
f.close()
return uniqvals
def tuniq2(filename):
f = tb.openFile(filename, "r")
t = f.root.table
# Build a dictionary for getting the different values per column
uniqvals = dict((name, set()) for name in t.colnames)
# Get the unique values for every column
for row in t:
for column in uniqvals:
uniqvals[column].add(row[column])
f.close()
return uniqvals
def tuniq3(filename):
f = tb.openFile(filename, "r")
t = f.root.table
# Build a dictionary for getting the different values per column
uniqvals = dict((name, {}) for name in t.colnames)
# Get the unique values for every column
for row in t:
for column in uniqvals:
uniqvals[column][row[column]] = None
f.close()
return uniqvals
def print_uniq(uniqvals):
cols = uniqvals.keys()
cols.sort()
for column in cols:
print "column: %s, unique values: %s" % (column, uniqvals[column])
if __name__ == '__main__':
print "Creating table with %s rows and %s columns..." % (NROWS, NCOLS)
createtable(FNAME)
#print "Getting unique values for each column (method 1)..."
#t0 = time()
#uniqvals = tuniq1(FNAME)
#t1 = time() - t0
#print_uniq(uniqvals)
#print "Time for finding unique values (method 1):", round(t1, 3)
print "Getting unique values for each column (method 2)..."
t0 = time()
uniqvals = tuniq2(FNAME)
t1 = time() - t0
#print_uniq(uniqvals)
print "Time for finding unique values (method 2):", round(t1, 3)
print "Getting unique values for each column (method 3)..."
t0 = time()
uniqvals = tuniq3(FNAME)
t1 = time() - t0
#print_uniq(uniqvals)
print "Time for finding unique values (method 3):", round(t1, 3)
------------------------------------------------------------------------------
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image
processing features enabled. http://p.sf.net/sfu/kodak-com
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users