A Sunday 11 May 2008, Francesc Alted escrigué:
> A Sunday 11 May 2008, Ivan Vilata i Balaguer escrigué:
> > Dinesh B Vadhia (el 2008-05-10 a les 10:10:29 -0700) va dir::
> > > I'm using the OS filesystem to store 32,000 images files.  I'm
> > > now going to move them into a datastore and the choices are
> > > pysqlite or MySQL or PyTables.  The number of images will grow
> > > rapidly (to the millions and more) and hence performance is
> > > critical.  Multiple images will be accessed from the data strore
> > > at a time.  There are no write operations just read only.
> > >
> > > The data schema is: image index (on the image filename), image
> > > filename, image (jpg initially but will be other formats in the
> > > future).
> > >
> > > Any and all suggestions would be appreciated.
> >
> > Well, I don't quite understand the data schema (are you describing
> > a row of three fields in a table), but you may have a look at the
> > ``tables.nodes.filenode`` module, which contains a ``FileNode``
> > class which offers a Python file-like interface to a PyTables
> > dataset (a one- dimensional ``EArray`` ) holding the bytes of the
> > file.  I should be specially useful if you keep images stored with
> > a file format like JPEG, PNG and the like.
> >
> > Also, I'd recommend not cramming all images under a single group to
> > avoid performance problems when opening the group, but to pack them
> > in groups of at most 4096 (see
> > ``tables.parameters.MAX_GROUP_WIDTH``) images per group.
>
> Yeah, I think Ivan is basically right on his appretiations.  However,
> I'd use a regular Array object for saving the images themselves
> instead of a FileNode.  A FileNode is meant more to deal with text
> where you can add and delete lines, but this is not the case of
> images.  For cases where you don't need the append/remove features,
> an Array is probably much more efficient.  And if you need
> compression, you may want to use a CArray instead.  The image index
> and filename may be saved as HDF5 attributes of the *Array objects.

Sorry, as Ivan told me privately, the FileNode serves to keep binary 
data as well as text.  However, since it inherits from EArray, you need 
to pass the `expectedrows` parameter for a good guess of the chunksizes 
of the datasets (which is important for achieving maximum I/O 
throughtput).  This is not necessary when using Array objects (no 
chunking) or CArray (where the dataset size is implicitely defined by 
the `shape` parameter).

At any rate, in cases where you are going to save large amounts of 
datasets, it is always nice to experiment with the different 
possibilities and choose the better for your case.

Cheers,

-- 
Francesc Alted

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to