Hello Michael, A Monday 24 September 2007, Michael Hoffman escrigué: > Thanks again for providing PyTables, which is a big help in my > research. I am now dealing with files of tens of gigabytes which > would probably be unthinkable to manage without PyTables. But using > such large files brings some performance issues. > > I have a 26 GiB PyTables file which has a hierarchy that looks > somewhat like this: > > root > . 256 Groups > .. approx. 64 Arrays (16328 arrays total) > ... 2000 x 105 float64s > > Before any caching takes place, File.walkNodes(classname="Array") is > very slow, taking more than six minutes to yield the first result. > This is despite the fact that I can get a result from File.getNode() > or File.walkNodes(group, classname="Array") quickly. I wrote a quick > script to benchmark this. The script and cProfile results are below. > > If I may hazard a guess before delving into the source too much, it > appears that File.walkNodes() is slow because it is a breadth-first > iteration. So it loads each of the toplevel groups into memory to > check if they are Arrays before moving onto Nodes lower in the > hierarchy. Is this right?
I don't think so. After considering your scenario, my opinion is that File.walkNodes() is 'slow' just because it has to retrieve many nodes that are sparsed through the disk. Your figures are saying that File.walkNodes() is retrieving nodes at 50 nodes/sec aprox., while in report [1] it is shown that PyTables/HDF5 can retrieve nodes at about 1000 nodes/sec. Why the difference? Well, I think because in [1] all the nodes were small and the total size of the file fits perfectly in OS filesystem cache (the benchmarks in [1] where repeated until the whole file was in-cache), so accessing the metainfo about the nodes is just a matter of browsing the pointers in memory, which is fast. However, your input file has 26 GB and most probably, the node metainfo is spread along this space on the disk. The first time that you open the file, File.walkNodes() has to gather all this metainfo on-disk, at very far places one of each other (there is likely a 'gap' of 1.6 MB, the size of each Array, between the different node metainfo), so the latency of the disk to reach the interesting data has to be taken into account. Access times of the 7200 RPM disks (commonly available nowadays) is around 13~17 ms, and you are seeing around 20 ms/node in your setup, which are pretty similar figures. So, IMO, what you are experiencing is the latency of your own disk. > If so, is there a way to speed this up? It seems to me that it > shouldn't be necessary to load all of these Groups into memory as you > know they are not Arrays and therefore don't match the classname I > specified. Could some additional lazy loading be added? > > Additionally, what would you think about allowing the user to specify > a depth-first iteration in the walk functions? That is how I will > work around the problem unless you think a fix to this issue will be > forthcoming very soon. Well, unfortunately, and for the reasons I exposed above, I don't think you can significantly accelerate the File.walkNodes() by just doing software improvements. One path you can follow is to reduce the number of nodes by making each array larger. But if you don't want to do that and you desperately need more speed, you could still try a hardware solution. Nowadays there are appearing relatively cheap solid state disks, based on flash memory, showing a very low latency (around 0.1 ms, see [2]), which will greatly accelerate your access needs (probably as much as 100x). [1]http://www.pytables.org/docs/NewObjectTreeCache.pdf [2]http://www.tomshardware.com/2007/08/13/flash_based_hard_drives_cometh/page7.html#access_time Hope that helps, -- >0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-" ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users