Hello Michael,

A Monday 24 September 2007, Michael Hoffman escrigué:
> Thanks again for providing PyTables, which is a big help in my
> research. I am now dealing with files of tens of gigabytes which
> would probably be unthinkable to manage without PyTables. But using
> such large files brings some performance issues.
>
> I have a 26 GiB PyTables file which has a hierarchy that looks
> somewhat like this:
>
> root
> . 256 Groups
> .. approx. 64 Arrays (16328 arrays total)
> ... 2000 x 105 float64s
>
> Before any caching takes place, File.walkNodes(classname="Array") is
> very slow, taking more than six minutes to yield the first result.
> This is despite the fact that I can get a result from File.getNode()
> or File.walkNodes(group, classname="Array") quickly. I wrote a quick
> script to benchmark this. The script and cProfile results are below.
>
> If I may hazard a guess before delving into the source too much, it
> appears that File.walkNodes() is slow because it is a breadth-first
> iteration. So it loads each of the toplevel groups into memory to
> check if they are Arrays before moving onto Nodes lower in the
> hierarchy. Is this right?

I don't think so.  After considering your scenario, my opinion is that 
File.walkNodes() is 'slow' just because it has to retrieve many nodes 
that are sparsed through the disk.

Your figures are saying that File.walkNodes() is retrieving nodes at 50 
nodes/sec aprox., while in report [1] it is shown that PyTables/HDF5 
can retrieve nodes at about 1000 nodes/sec.  Why the difference?  Well, 
I think because in [1] all the nodes were small and the total size of 
the file fits perfectly in OS filesystem cache (the benchmarks in [1] 
where repeated until the whole file was in-cache), so accessing the 
metainfo about the nodes is just a matter of browsing the pointers in 
memory, which is fast.

However, your input file has 26 GB and most probably, the node metainfo 
is spread along this space on the disk.  The first time that you open 
the file, File.walkNodes() has to gather all this metainfo on-disk, at 
very far places one of each other (there is likely a 'gap' of 1.6 MB, 
the size of each Array, between the different node metainfo), so the 
latency of the disk to reach the interesting data has to be taken into 
account.  Access times of the 7200 RPM disks (commonly available 
nowadays) is around 13~17 ms, and you are seeing around 20 ms/node in 
your setup, which are pretty similar figures.  So, IMO, what you are 
experiencing is the latency of your own disk.

> If so, is there a way to speed this up? It seems to me that it
> shouldn't be necessary to load all of these Groups into memory as you
> know they are not Arrays and therefore don't match the classname I
> specified. Could some additional lazy loading be added?
>
> Additionally, what would you think about allowing the user to specify
> a depth-first iteration in the walk functions? That is how I will
> work around the problem unless you think a fix to this issue will be
> forthcoming very soon.

Well, unfortunately, and for the reasons I exposed above, I don't think 
you can significantly accelerate the File.walkNodes() by just doing 
software improvements.  One path you can follow is to reduce the number 
of nodes by making each array larger.  But if you don't want to do that 
and you desperately need more speed, you could still try a hardware 
solution.  Nowadays there are appearing relatively cheap solid state 
disks, based on flash memory, showing a very low latency (around 0.1 
ms, see [2]), which will greatly accelerate your access needs (probably 
as much as 100x).

[1]http://www.pytables.org/docs/NewObjectTreeCache.pdf
[2]http://www.tomshardware.com/2007/08/13/flash_based_hard_drives_cometh/page7.html#access_time

Hope that helps,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to