Thanks again for providing PyTables, which is a big help in my research. I am now dealing with files of tens of gigabytes which would probably be unthinkable to manage without PyTables. But using such large files brings some performance issues.
I have a 26 GiB PyTables file which has a hierarchy that looks somewhat like this: root . 256 Groups .. approx. 64 Arrays (16328 arrays total) ... 2000 x 105 float64s Before any caching takes place, File.walkNodes(classname="Array") is very slow, taking more than six minutes to yield the first result. This is despite the fact that I can get a result from File.getNode() or File.walkNodes(group, classname="Array") quickly. I wrote a quick script to benchmark this. The script and cProfile results are below. If I may hazard a guess before delving into the source too much, it appears that File.walkNodes() is slow because it is a breadth-first iteration. So it loads each of the toplevel groups into memory to check if they are Arrays before moving onto Nodes lower in the hierarchy. Is this right? If so, is there a way to speed this up? It seems to me that it shouldn't be necessary to load all of these Groups into memory as you know they are not Arrays and therefore don't match the classname I specified. Could some additional lazy loading be added? Additionally, what would you think about allowing the user to specify a depth-first iteration in the walk functions? That is how I will work around the problem unless you think a fix to this issue will be forthcoming very soon. ==== The script: """ import cProfile, tables big = tables.openFile("big.h5") cProfile.run('big.getNode("/_00", "ENST00000260061")', "leaf.prof") cProfile.run('big.walkNodes("/_38", classname="Array").next()', "group.prof") cProfile.run('big.walkNodes("/", classname="Array").next()', "root.prof") big.close() """ The results: """ $ for FILE in *.prof; do python -c "import pstats; pstats.Stats(\"$FILE\").strip_dirs().sort_stats('time').print_stats(5)"; done Sun Sep 23 22:47:11 2007 group.prof 792 function calls (781 primitive calls) in 1.641 CPU seconds Ordered by: internal time List reduced from 107 to 5 due to restriction <5> ncalls tottime percall cumtime percall filename:lineno(function) 1 1.636 1.636 1.636 1.636 {method '_g_listGroup' of 'hdf5Extension.Group' objects} 1 0.001 0.001 1.638 1.638 group.py:389(_g_addChildrenNames) 56 0.000 0.000 0.001 0.000 file.py:563(_ptNameFromH5Name) 112 0.000 0.000 0.000 0.000 proxydict.py:33(__setitem__) 56 0.000 0.000 0.000 0.000 path.py:166(isVisibleName) Sun Sep 23 22:47:09 2007 leaf.prof 676 function calls (665 primitive calls) in 1.307 CPU seconds Ordered by: internal time List reduced from 95 to 5 due to restriction <5> ncalls tottime percall cumtime percall filename:lineno(function) 1 1.303 1.303 1.303 1.303 {method '_g_listGroup' of 'hdf5Extension.Group' objects} 1 0.001 0.001 1.304 1.304 group.py:389(_g_addChildrenNames) 24 0.001 0.000 0.001 0.000 group.py:883(__setattr__) 46 0.000 0.000 0.000 0.000 file.py:563(_ptNameFromH5Name) 92 0.000 0.000 0.000 0.000 proxydict.py:33(__setitem__) Sun Sep 23 22:53:25 2007 root.prof 165265 function calls (164492 primitive calls) in 374.321 CPU seconds Ordered by: internal time List reduced from 124 to 5 due to restriction <5> ncalls tottime percall cumtime percall filename:lineno(function) 254 373.471 1.470 373.471 1.470 {method '_g_listGroup' of 'hdf5Extension.Group' objects} 254 0.187 0.001 374.042 1.473 group.py:389(_g_addChildrenNames) 16226 0.098 0.000 0.157 0.000 file.py:563(_ptNameFromH5Name) 32452 0.095 0.000 0.095 0.000 proxydict.py:33(__setitem__) 16226 0.059 0.000 0.099 0.000 path.py:166(isVisibleName) """ ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users