Re: [Pytables-users] Efficient access to large numbers of datasets

Francesc Altet Mon, 09 Apr 2007 04:49:57 -0700

A Dissabte 07 Abril 2007 13:55, Michael Hoffman escrigué:
> PyTables has been a great help to my research. I was wondering if I
> could make my use somewhat more efficient.
>
> For a particular project, I produce about 22000 tables of 2000 rows
> each. These are initially produced by a distributing computing farm into
> 2500 files, but I concatenate them so that (a) I won't have so many
> files lying around, which our system administrators hate, and (b) I can
> randomly access tables by name easily.
>
> Of course, having about 4 GiB and 22000 tables in one file slows things
> down a bit, especially since it is stored on a remote Lustre file
> system. One thing I thought of was to find some middle ground and
> concatenate the original file set into a small number of files, but not
> just one. Then I could make a separate file for an index to provide
> random access.
>
> Is this a good idea? Any suggestions as to a target number of datasets
> (I know 4096 was once suggested as the max) or data size per file? Are
> there any facilities within PyTables or elsewhere to make this easier?


Well, it largely depends on your requeriments. In these days that disks do 
offer huge capacities at reasonable prices, I'd advocate by creating a 
monolithic file containing all your data.  You can use compression in order 
to reduce the size of the file.  Also, reducing the number of tables and 
enlarging them can also contribute to keep the demand for space low and, as a 
side effect, reduce the number of objects in the hierarchy too.  In 
principal, accessing a large number of objects in the tree shouldn't exhibit 
a very noticiable slowdown, specially when the LRU cache of objects starts to 
warm.

Regarding the limitation of 4096 nodes per group, it was removed some time ago 
(PyTables 1.2?), but it is true that keeping this number low will help making 
the object browsing faster, so in case you have too many objects, you can 
always endow them with a deeper tree.

In case this is definitely not feasible for you, then I'd go by splitting the  
file to the limit it is manageable and build a separate index in order to 
access them, as you suggest.  For doing this, another possibility is that 
PyTables would support the mounting of external files in one place of the 
hierarchy (very similar as mounting a filesystem in another). Unfortunately, 
and despite the fact that HDF5 already comes with such a support, it has not 
been implemented yet on PyTables (and to say the truth, we have no plans in 
the near end to implement it).

Hope that helps,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Efficient access to large numbers of datasets

Reply via email to