I am attempting to store a large number of moderately-sized variable-length numpy arrays in a PyTables database, where each array can be referred to by a string key. Looking through the mailing list archives, it seems that one possible solution to this problem is to simply create a large number of Array objects. However, I have found write times to be highly variable when working with a large number of arrays (100,000 for example). For example, consider the code below:
a1 = np.arange(1000) h5f = tables.openFile("test.h5f", mode="w") for i in range(10000): h5f.createArray("/", "test"+str(i), a1) In this simple example, I take a numpy array with 1,000 integers and write it to a database 10,000 times. This typically takes about 7 seconds (PyTables 2.1.2 with Python 2.6 on Windows Vista). If I then increase the number of writes to 100,000, however, the performance can become quite nonlinear. I have had the operation complete in anywhere from about a minute and a half to seven minutes. Moreover, it is sometimes that case that when I then go back to writing 10,000 arrays, the operation no longer takes 7 seconds but rather close to 40 seconds. Keep in mind that mode="w", so the database should be starting fresh each time. When this happens, the only way I can seem to get the write time back down to 7 seconds is to manually delete the database file, which is surprising because it seems that this should happen anyway when mode="w". One thing I was a bit confused about at first was whether my performance problems going from 10,000 writes to 100,000 had something to do with creating too many arrays under a single group. After all, I do receive a PerformanceWarning when I exceed 4096 nodes, although it is not clear to me whether this is a legacy warning that only applied when PyTables had a load all nodes when a database was opened. In any case, I tried splitting up my 100,000 writes by creating 100 groups and placing 1,000 arrays in each. This did not seem to resolve the issue. My question then is, am I doing something wrong in my code, and what is the best way to handle situations in which you need to work with databases that need to have a large number of numpy arrays stored in them and accessible with a textual key? Thank you. Abiel Reinhart ------------------------------------------------------------------------------ Come build with us! The BlackBerry® Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9-12, 2009. Register now! http://p.sf.net/sfu/devconf _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users