Hello Balaji,

> A Dijous 14 Juny 2007 14:39, escriguéreu:
> > Hi,
> >
> > I'm new to pytables and would like to work with tables which have  
> > multiple columns with variable length strings. This seems like a  
> > common use case, but as most of the examples in the manual deal with  
> > fixed-length strings I was wondering what the best practices are for  
> > this situation. There are a few possibilities which occurred to me:
> >
> > 1) Make a group in which fixed width columns are kept in one table,  
> > with separate VLArrays for each of the variable length columns using  
> > the VLStringAtom(). Disadvantage: now the data is spread over the  
> > place in multiple variables and the number of rows in the main table  
> > and the individual VLArrays are not constrained to be the same.

That is an usual solution for what I perceived from users.

> > 2) Make a VLArray with an ObjectAtom for each row. Then encode each  
> > row in the table as a single python object. Disadvantage: while  
> > possible, now you can't do any kind of indexing and the benefits of  
> > keeping the data in pytables format are less apparent.

Yeah. I don't like this one precisely because of what you say: the benefits of 
pytables are quite reduced.

> > 3) Truncate variable length strings by estimating a maximum length of  
> > each string, thereby forcing the data into a standard table with  
> > fixed length StringCols. Disadvantage: This is a hack as it's not  
> > always obvious what the maximum length will be. Moreover, space would  
> > be wasted as most strings are much less than the maximum length.

This is also a solution that I particulary like very much and, in order to 
save as many space as possible, compression can be used.

For example, I'm attaching a small script that tells how much disk space do 
take a table with 100000 entries and with a couple of fields: an Int32 and a 
StringCol. The StringCol comes with different sizes: 16 (Small), 160 (Medium) 
and 1600 (Big). Here are the results when not using compression:

Size for /tmp/TestSmall.h5: 2.0M
Size for /tmp/TestMedium.h5: 16M
Size for /tmp/TestBig.h5: 154M

and here when using compression:

Size for /tmp/TestSmall.h5: 180K
Size for /tmp/TestMedium.h5: 400K
Size for /tmp/TestBig.h5: 1.7M

So, for a record size 10 times bigger, the space needed is just 2.2x more and 
for a record size 100 times bigger, the required space on disk is just 10x 
more.

Well, this is not a perfect solution but can be useful in some scenarios.

> > The basic issue is that VLStringCol does not exist.  I understand the  
> > reasons for this (it breaks the concept of having a fixed-width  
> > record length, unless you use a pointer in the record), but is there  
> > any available workaround? E.g. is there any way to nest multiple  
> > VLArray objects in a table?

Unfortunately not.  Ivan and me were toying with the idea of implementing a 
kind of VLTable (a group actually) which can be populated with different 
combinations of EArrays and VLArrays so that a VLStringCol could be supported 
seamlessly.  However, such implementation is not trivial at all and will not 
be implemented anytime soon (unless someone would contribute it or is willing 
to sponsorize the work).

> > PS: When is Pytables Pro scheduled to be released? The website says  
> > April 2007, but I haven't seen any recent postings about it.

Actually we are doing the final steps for the big event. Basically we are in 
the stage of building binary packages for the most common platforms (Win, 
MacOSX and Linux) in order to make the installation as easy as possible.  In 
any case, and although our hope was the release to happen in June, I'm afraid 
that the first half of July would be more realistic (and seriously, we don't 
think that a significant delay over this new deadline would happen again).

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"

Attachment: prova.py
Description: application/python

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to