The problem in brief: Why does it take 20-40 seconds to extract a table column of 200000 integers? The code snippet in question is:
with pt.openFile(filename) as f: vlarrayrow = f.root.gp.cols.vlarrayrow[:] == Background == I have an HDF5 file (let's call it file "P") containing parameter scenarios for simulations. It is about 1 GB size and about 200000 records in each of four tables. Another HDF5 file (file "O") holds 40 GB of simulation output, but not in the same order as in file P. The output row number for each parameter scenario is stored in a column of a table in file P. However, extracting just that column is disappointingly slow. I have profiled the following example script, output is below [1]. == prof_pt.py == """Run with: python -m cProfile prof_pt.py""" from __future__ import with_statement import tables as pt filename = "myfile.py.h5" with pt.openFile(filename) as f: vlarrayrow = f.root.gp.cols.vlarrayrow[:] A lot of the time seems to be in "method '_fillCol' of 'tableExtension.Row' objects", file.py and __init__.py (I don't know what the latter two are). == Reproducible example == I've made an example script to replicate the problem without reference to the details of my actual use case. I hypothesized that cols.x[:] was slow because it had to skip (or perhaps even read, *shudder*) the remainder of every record in the table. However, it seems there is a critical file size at which the problem manifests. The example makes a table with "nrow" records of the following description, where "othersize" sets the amount of other stuff in the table. class result(pt.IsDescription): otherdata = pt.IntCol(shape=othersize) i = pt.IntCol() Then I time the statement: i = f.root.t.cols.i[:] The code below [2] produces the following output: -bash-3.2$ python profcols.py ... INFO:root:4.74983906746 seconds, (nrow,othersize=50000,2000) INFO:root:0.610723972321 seconds, (nrow,othersize=52000,2000) INFO:root:0.568428993225 seconds, (nrow,othersize=54000,2000) INFO:root:4.76227211952 seconds, (nrow,othersize=56000,2000) INFO:root:5.63922095299 seconds, (nrow,othersize=58000,2000) INFO:root:20.6964008808 seconds, (nrow,othersize=50000,4000) INFO:root:19.6613388062 seconds, (nrow,othersize=52000,4000) INFO:root:18.8091700077 seconds, (nrow,othersize=54000,4000) ... (I stopped it here; the file was getting to 800 MB) For comparable sizes of the table, getting "i" sometimes takes about half a second, sometimes ten times as much. Doubling the record size brings the access time up to forty times as much. (If you want to replicate this, you may have to fiddle a bit with nrow and othersize to find the threshold on your system; at least I did.) Thank you in advance for any advice. Best regards, Jon Olav Vik [1] == Profile output by cumulative time (excerpt) == ncalls tottime percall cumtime percall filename:lineno(function) 1 0 0 108.265 108.265 <string>:1(<module>) 1 0.029 0.029 108.265 108.265 {execfile} 1 0.277 0.277 108.236 108.236 prof_pt.py:1(<module>) 1 5.746 5.746 58.949 58.949 __init__.py:53(<module>) 1 0 0 48.686 48.686 table.py:2822(__getitem__) 1 0 0 48.685 48.685 table.py:1439(_read) 1 0 0 48.685 48.685 table.py:1496(read) 1 48.565 48.565 48.685 48.685 {method '_fillCol' of 'tableExtension.Row' objects} 1 3.954 3.954 29.907 29.907 __init__.py:63(<module>) 2 2.413 1.207 18.816 9.408 __init__.py:1(<module>) 1 9.371 9.371 17.339 17.339 file.py:35(<module>) 5 4.795 0.959 16.513 3.303 __init__.py:2(<module>) 1 0.022 0.022 15.749 15.749 add_newdocs.py:9(<module>) 1 0.486 0.486 10.239 10.239 type_check.py:3(<module>) == Profile output by total time (excerpt) == ncalls tottime percall cumtime percall filename:lineno(function) 1 48.6 48.565 48.685 48.685 {method '_fillCol' of 'tableExtension.Row' objects} 1 9.4 9.371 17.339 17.339 file.py:35(<module>) 1 5.7 5.746 58.949 58.949 __init__.py:53(<module>) 5 4.8 0.959 16.513 3.303 __init__.py:2(<module>) 1 4.0 3.954 29.907 29.907 __init__.py:63(<module>) 1 3.7 3.681 3.689 3.689 Numeric.py:85(<module>) 2 2.6 1.291 2.589 1.294 __init__.py:8(<module>) 2 2.4 1.207 18.816 9.408 __init__.py:1(<module>) 1 2.0 1.983 1.983 1.983 weakref.py:6(<module>) 1 1.9 1.939 1.939 1.939 arrayprint.py:4(<module>) 1 1.7 1.664 3.679 3.679 numpytest.py:1(<module>) 1 1.5 1.528 3.469 3.469 numeric.py:1(<module>) 1 1.5 1.45 1.45 1.45 linalg.py:10(<module>) 1 1.4 1.424 2.836 2.836 urllib2.py:74(<module>) 1 1.3 1.312 2.78 2.78 utils.py:1(<module>) 1 1.2 1.202 1.204 1.204 expressions.py:1(<module>) 1 1.1 1.147 1.15 1.15 group.py:30(<module>) 1 1.1 1.111 4.801 4.801 flavor.py:43(<module>) 1 1.0 1.027 1.115 1.115 httplib.py:67(<module>) 1 1.0 1.025 1.717 1.717 utils.py:3(<module>) [2] == profcols.py == from __future__ import with_statement import tables as pt import numpy as np import os import time import logging logging.basicConfig(level=logging.INFO) # change to logging.DEBUG for details filename = "profcols.h5" chunksize=25000 def descr(othersize): class result(pt.IsDescription): otherdata = pt.IntCol(shape=othersize) i = pt.IntCol() return result def createfile(nrow, othersize): if os.path.exists(filename): os.remove(filename) with pt.openFile(filename, "w") as f: t = f.createTable(f.root, "t", descr(othersize), expectedrows=nrow) d, m = divmod(nrow, chunksize) for i in range(d): logging.debug("Appending %s records" % chunksize) t.append(np.ones(chunksize, t._v_dtype)) logging.debug("Appending %s records" % m) t.append(np.ones(m, t._v_dtype)) def timecols(nrow=None, othersize=None): if nrow: createfile(nrow, othersize) logging.debug("Accessing column (nrow,othersize=%s,%s)" % (nrow, othersize)) t0 = time.time() with pt.openFile(filename) as f: i = f.root.t.cols.i[:] t1 = time.time() logging.info("%s seconds, (nrow,othersize=%s,%s)" % (t1-t0, nrow, othersize)) return t1-t0 print timecols(50000, 2000) sec = [[timecols(nrow, othersize) for nrow in range(50000, 60000, 2000)] for othersize in range(2000, 2500, 200)] print sec ------------------------------------------------------------------------------ Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/blackberry _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users