I am attempting to store a large number of moderately-sized
variable-length numpy arrays in a PyTables database, where each array
can be referred to by a string key. Looking through the mailing list
archives, it seems that one possible solution to this problem is to
simply create a large number of Array objects. However, I have found
write times to be highly variable when working with a large number of
arrays (100,000 for example). For example, consider the code below:

a1 = np.arange(1000)
h5f = tables.openFile("test.h5f", mode="w")
for i in range(10000):
    h5f.createArray("/", "test"+str(i), a1)

In this simple example, I take a numpy array with 1,000 integers and
write it to a database 10,000 times. This typically takes about 7
seconds (PyTables 2.1.2 with Python 2.6 on Windows Vista). If I then
increase the number of writes to 100,000, however, the performance can
become quite nonlinear. I have had the operation complete in anywhere
from about a minute and a half to seven minutes. Moreover, it is
sometimes that case that when I then go back to writing 10,000 arrays,
the operation no longer takes 7 seconds but rather close to 40
seconds. Keep in mind that mode="w", so the database should be
starting fresh each time. When this happens, the only way I can seem
to get the write time back down to 7 seconds is to manually delete the
database file, which is surprising because it seems that this should
happen anyway when mode="w".

One thing I was a bit confused about at first was whether my
performance problems going from 10,000 writes to 100,000 had something
to do with creating too many arrays under a single group. After all, I
do receive a PerformanceWarning when I exceed 4096 nodes, although it
is not clear to me whether this is a legacy warning that only applied
when PyTables had a load all nodes when a database was opened. In any
case, I tried splitting up my 100,000 writes by creating 100 groups
and placing 1,000 arrays in each. This did not seem to resolve the
issue.

My question then is, am I doing something wrong in my code, and what
is the best way to handle situations in which you need to work with
databases that need to have a large number of numpy arrays stored in
them and accessible with a textual key?

Thank you.

Abiel Reinhart

------------------------------------------------------------------------------
Come build with us! The BlackBerry® Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9-12, 2009. Register now!
http://p.sf.net/sfu/devconf
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to