Thanks again for providing PyTables, which is a big help in my research. 
I am now dealing with files of tens of gigabytes which would probably be 
unthinkable to manage without PyTables. But using such large files 
brings some performance issues.

I have a 26 GiB PyTables file which has a hierarchy that looks somewhat 
like this:

root
. 256 Groups
.. approx. 64 Arrays (16328 arrays total)
... 2000 x 105 float64s

Before any caching takes place, File.walkNodes(classname="Array") is 
very slow, taking more than six minutes to yield the first result. This 
is despite the fact that I can get a result from File.getNode() or 
File.walkNodes(group, classname="Array") quickly. I wrote a quick script 
to benchmark this. The script and cProfile results are below.

If I may hazard a guess before delving into the source too much, it 
appears that File.walkNodes() is slow because it is a breadth-first 
iteration. So it loads each of the toplevel groups into memory to check 
if they are Arrays before moving onto Nodes lower in the hierarchy. Is 
this right?

If so, is there a way to speed this up? It seems to me that it shouldn't 
be necessary to load all of these Groups into memory as you know they 
are not Arrays and therefore don't match the classname I specified. 
Could some additional lazy loading be added?

Additionally, what would you think about allowing the user to specify a 
depth-first iteration in the walk functions? That is how I will work 
around the problem unless you think a fix to this issue will be 
forthcoming very soon.

====

The script:

"""
import cProfile, tables

big = tables.openFile("big.h5")

cProfile.run('big.getNode("/_00", "ENST00000260061")', "leaf.prof")
cProfile.run('big.walkNodes("/_38", classname="Array").next()', 
"group.prof")
cProfile.run('big.walkNodes("/", classname="Array").next()', "root.prof")

big.close()
"""

The results:

"""
$ for FILE in *.prof; do python -c "import pstats; 
pstats.Stats(\"$FILE\").strip_dirs().sort_stats('time').print_stats(5)"; 
done
Sun Sep 23 22:47:11 2007    group.prof

          792 function calls (781 primitive calls) in 1.641 CPU seconds

    Ordered by: internal time
    List reduced from 107 to 5 due to restriction <5>

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1    1.636    1.636    1.636    1.636 {method '_g_listGroup' of 
'hdf5Extension.Group' objects}
         1    0.001    0.001    1.638    1.638 
group.py:389(_g_addChildrenNames)
        56    0.000    0.000    0.001    0.000 
file.py:563(_ptNameFromH5Name)
       112    0.000    0.000    0.000    0.000 proxydict.py:33(__setitem__)
        56    0.000    0.000    0.000    0.000 path.py:166(isVisibleName)


Sun Sep 23 22:47:09 2007    leaf.prof

          676 function calls (665 primitive calls) in 1.307 CPU seconds

    Ordered by: internal time
    List reduced from 95 to 5 due to restriction <5>

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1    1.303    1.303    1.303    1.303 {method '_g_listGroup' of 
'hdf5Extension.Group' objects}
         1    0.001    0.001    1.304    1.304 
group.py:389(_g_addChildrenNames)
        24    0.001    0.000    0.001    0.000 group.py:883(__setattr__)
        46    0.000    0.000    0.000    0.000 
file.py:563(_ptNameFromH5Name)
        92    0.000    0.000    0.000    0.000 proxydict.py:33(__setitem__)


Sun Sep 23 22:53:25 2007    root.prof

          165265 function calls (164492 primitive calls) in 374.321 CPU 
seconds

    Ordered by: internal time
    List reduced from 124 to 5 due to restriction <5>

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       254  373.471    1.470  373.471    1.470 {method '_g_listGroup' of 
'hdf5Extension.Group' objects}
       254    0.187    0.001  374.042    1.473 
group.py:389(_g_addChildrenNames)
     16226    0.098    0.000    0.157    0.000 
file.py:563(_ptNameFromH5Name)
     32452    0.095    0.000    0.095    0.000 proxydict.py:33(__setitem__)
     16226    0.059    0.000    0.099    0.000 path.py:166(isVisibleName)
"""


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to