The problem in brief: Why does it take 20-40 seconds to extract a table column 
of 200000 integers? The code snippet in question is:

with pt.openFile(filename) as f:
    vlarrayrow = f.root.gp.cols.vlarrayrow[:]


== Background ==
I have an HDF5 file (let's call it file "P") containing parameter scenarios for 
simulations. It is about 1 GB size and about 200000 records in each of four 
tables. Another HDF5 file (file "O") holds 40 GB of simulation output, but not 
in the same order as in file P. The output row number for each parameter 
scenario is stored in a column of a table in file P. However, extracting just 
that column is disappointingly slow. I have profiled the following example 
script, output is below [1].

== prof_pt.py ==
"""Run with:    python -m cProfile prof_pt.py"""
from __future__ import with_statement
import tables as pt
filename = "myfile.py.h5"
with pt.openFile(filename) as f:
    vlarrayrow = f.root.gp.cols.vlarrayrow[:]

A lot of the time seems to be in "method '_fillCol' of 'tableExtension.Row' 
objects", file.py and __init__.py (I don't know what the latter two are).


== Reproducible example ==
I've made an example script to replicate the problem without reference to the 
details of my actual use case. I hypothesized that cols.x[:] was slow because 
it had to skip (or perhaps even read, *shudder*) the remainder of every record 
in the table. However, it seems there is a critical file size at which the 
problem manifests.

The example makes a table with "nrow" records of the following description, 
where "othersize" sets the amount of other stuff in the table.

class result(pt.IsDescription):
    otherdata = pt.IntCol(shape=othersize)
    i = pt.IntCol()

Then I time the statement:

i = f.root.t.cols.i[:]

The code below [2] produces the following output:

-bash-3.2$ python profcols.py
...
INFO:root:4.74983906746 seconds, (nrow,othersize=50000,2000)
INFO:root:0.610723972321 seconds, (nrow,othersize=52000,2000)
INFO:root:0.568428993225 seconds, (nrow,othersize=54000,2000)
INFO:root:4.76227211952 seconds, (nrow,othersize=56000,2000)
INFO:root:5.63922095299 seconds, (nrow,othersize=58000,2000)
INFO:root:20.6964008808 seconds, (nrow,othersize=50000,4000)
INFO:root:19.6613388062 seconds, (nrow,othersize=52000,4000)
INFO:root:18.8091700077 seconds, (nrow,othersize=54000,4000)
... (I stopped it here; the file was getting to 800 MB)

For comparable sizes of the table, getting "i" sometimes takes about half a 
second, sometimes ten times as much. Doubling the record size brings the access 
time up to forty times as much.

(If you want to replicate this, you may have to fiddle a bit with nrow and 
othersize to find the threshold on your system; at least I did.)

Thank you in advance for any advice.

Best regards,
Jon Olav Vik


[1]
== Profile output by cumulative time (excerpt) ==
ncalls  tottime percall cumtime percall filename:lineno(function)
1       0       0       108.265 108.265 <string>:1(<module>)
1       0.029   0.029   108.265 108.265 {execfile}
1       0.277   0.277   108.236 108.236 prof_pt.py:1(<module>)
1       5.746   5.746   58.949  58.949  __init__.py:53(<module>)
1       0       0       48.686  48.686  table.py:2822(__getitem__)
1       0       0       48.685  48.685  table.py:1439(_read)
1       0       0       48.685  48.685  table.py:1496(read)
1       48.565  48.565  48.685  48.685  {method '_fillCol' of 
'tableExtension.Row' objects}
1       3.954   3.954   29.907  29.907  __init__.py:63(<module>)
2       2.413   1.207   18.816  9.408   __init__.py:1(<module>)
1       9.371   9.371   17.339  17.339  file.py:35(<module>)
5       4.795   0.959   16.513  3.303   __init__.py:2(<module>)
1       0.022   0.022   15.749  15.749  add_newdocs.py:9(<module>)
1       0.486   0.486   10.239  10.239  type_check.py:3(<module>)

== Profile output by total time (excerpt) ==
ncalls  tottime percall cumtime percall filename:lineno(function)
1       48.6    48.565  48.685  48.685  {method '_fillCol' of 
'tableExtension.Row' objects}
1       9.4     9.371   17.339  17.339  file.py:35(<module>)
1       5.7     5.746   58.949  58.949  __init__.py:53(<module>)
5       4.8     0.959   16.513  3.303   __init__.py:2(<module>)
1       4.0     3.954   29.907  29.907  __init__.py:63(<module>)
1       3.7     3.681   3.689   3.689   Numeric.py:85(<module>)
2       2.6     1.291   2.589   1.294   __init__.py:8(<module>)
2       2.4     1.207   18.816  9.408   __init__.py:1(<module>)
1       2.0     1.983   1.983   1.983   weakref.py:6(<module>)
1       1.9     1.939   1.939   1.939   arrayprint.py:4(<module>)
1       1.7     1.664   3.679   3.679   numpytest.py:1(<module>)
1       1.5     1.528   3.469   3.469   numeric.py:1(<module>)
1       1.5     1.45    1.45    1.45    linalg.py:10(<module>)
1       1.4     1.424   2.836   2.836   urllib2.py:74(<module>)
1       1.3     1.312   2.78    2.78    utils.py:1(<module>)
1       1.2     1.202   1.204   1.204   expressions.py:1(<module>)
1       1.1     1.147   1.15    1.15    group.py:30(<module>)
1       1.1     1.111   4.801   4.801   flavor.py:43(<module>)
1       1.0     1.027   1.115   1.115   httplib.py:67(<module>)
1       1.0     1.025   1.717   1.717   utils.py:3(<module>)

[2]
== profcols.py ==
from __future__ import with_statement
import tables as pt
import numpy as np
import os
import time
import logging
logging.basicConfig(level=logging.INFO) # change to logging.DEBUG for details 

filename = "profcols.h5"
chunksize=25000

def descr(othersize):
    class result(pt.IsDescription):
        otherdata = pt.IntCol(shape=othersize)
        i = pt.IntCol()
    return result

def createfile(nrow, othersize):
    if os.path.exists(filename):
        os.remove(filename)
    with pt.openFile(filename, "w") as f:
        t = f.createTable(f.root, "t", descr(othersize), expectedrows=nrow)
        d, m = divmod(nrow, chunksize)
        for i in range(d):
            logging.debug("Appending %s records" % chunksize)
            t.append(np.ones(chunksize, t._v_dtype))
        logging.debug("Appending %s records" % m)
        t.append(np.ones(m, t._v_dtype))

def timecols(nrow=None, othersize=None):
    if nrow:
        createfile(nrow, othersize)
    logging.debug("Accessing column (nrow,othersize=%s,%s)" % (nrow, othersize))
    t0 = time.time()
    with pt.openFile(filename) as f:
        i = f.root.t.cols.i[:]
    t1 = time.time()
    logging.info("%s seconds, (nrow,othersize=%s,%s)" 
        % (t1-t0, nrow, othersize))
    return t1-t0

print timecols(50000, 2000)

sec = [[timecols(nrow, othersize) for nrow in range(50000, 60000, 2000)]
    for othersize in range(2000, 2500, 200)]

print sec



------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time, 
vendors submitting new applications to BlackBerry App World(TM) will have 
the opportunity to enter the BlackBerry Developer Challenge. See full prize 
details at: http://p.sf.net/sfu/blackberry
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to