Hi Abiel, A Saturday 26 September 2009 05:07:34 Abiel Reinhart escrigué: > I am attempting to store a large number of moderately-sized > variable-length numpy arrays in a PyTables database, where each array > can be referred to by a string key. Looking through the mailing list > archives, it seems that one possible solution to this problem is to > simply create a large number of Array objects. However, I have found > write times to be highly variable when working with a large number of > arrays (100,000 for example). For example, consider the code below: > > a1 = np.arange(1000) > h5f = tables.openFile("test.h5f", mode="w") > for i in range(10000): > h5f.createArray("/", "test"+str(i), a1) > > In this simple example, I take a numpy array with 1,000 integers and > write it to a database 10,000 times. This typically takes about 7 > seconds (PyTables 2.1.2 with Python 2.6 on Windows Vista). If I then > increase the number of writes to 100,000, however, the performance can > become quite nonlinear. I have had the operation complete in anywhere > from about a minute and a half to seven minutes. Moreover, it is > sometimes that case that when I then go back to writing 10,000 arrays, > the operation no longer takes 7 seconds but rather close to 40 > seconds. Keep in mind that mode="w", so the database should be > starting fresh each time. When this happens, the only way I can seem > to get the write time back down to 7 seconds is to manually delete the > database file, which is surprising because it seems that this should > happen anyway when mode="w".
Well, working with many datasets (many >= 10,000) is not recommended because it would take too much metadata to be loaded mainly in *HDF5* caches and this is exactly why a warning is issued when too many nodes are used. > One thing I was a bit confused about at first was whether my > performance problems going from 10,000 writes to 100,000 had something > to do with creating too many arrays under a single group. After all, I > do receive a PerformanceWarning when I exceed 4096 nodes, although it > is not clear to me whether this is a legacy warning that only applied > when PyTables had a load all nodes when a database was opened. In any > case, I tried splitting up my 100,000 writes by creating 100 groups > and placing 1,000 arrays in each. This did not seem to resolve the > issue. Yeah. PyTables can warn when you have many nodes in one single group or the level of nested groups is too high because it has counters for such a things. However, it doesn't have a counter for the *total* number of nodes in file, and this is why a warning is not issued in this case. > My question then is, am I doing something wrong in my code, and what > is the best way to handle situations in which you need to work with > databases that need to have a large number of numpy arrays stored in > them and accessible with a textual key? No, there is not anything wrong in your code except that PyTables (or, more exactly, HDF5) is not thought for handling very large amounts of nodes, but rather for a few to moderate number of nodes that can be extremely large. David's suggestion of VLArrays maybe the way to go, but you are right that deleting a row in a VLArray will imply rewriting the entire dataset, and this is not efficient. Mmh, I'm sorry but perhaps you should try another tools for doing what you want to do. HTH, -- Francesc Alted ------------------------------------------------------------------------------ Come build with us! The BlackBerry® Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9-12, 2009. Register now! http://p.sf.net/sfu/devconf _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users