Thanks Josh (I have evidently sent a mail from the wrong address...)

On Dec 13, 2011, at 11:14 AM, PyTables Org wrote:

> Forwarding to the list. ~Josh.
> 
> Begin forwarded message:

>> Hi all, I need to store millions of entries made by a n-long string, a n 
>> array of integers and an ID for every string/array couple. Are there best 
>> practices to structure a proper pytable for this? Also, does the structure 
>> influence compression efficiency?


I'm trying some combination... the dummiest implementation is a table with 
three columns, each made by a string (yes, the array of integer is actually 
another string I read from a file). I achieve a good compression and reasonable 
speed. I now want to push the whole thing so I have to understand a couple of 
things. First of all assume that each data entry is made like this

(ID, S, Q)

where ID is the entry id (unique), S is a fixed size string made by a limited 
alphabet (5 letters), Q is a fixed size string made by limited alphabet (90 
letters), len(S) == len(Q).
- does it make any difference to store S (or Q) as a string or a uint8 array? 
- I tried to store the same thing into 3 VLArrays and everything was slower and 
resulting file was bigger, since length is fixed I thought I could use CArray 
but I don't know which may be the best strategy to assign a CArray (two, 
actually) for each entry... can I simply build a table with array as column 
elements?
- does it make any difference to store the ID as a "value" (i.e. an item in a 
table column or array) or as a name for the item? I thought I could create a 
file with two groups (/Q and /S), for each group I tried to add a CArray 
/Q/ID1, /Q/ID2... is it a good choice?
- When does Blosc perform worse than zlib? I mean, Except for my tests :-P 


Note that len(S) << 10000 and len(ID) < 100. I have millions of small data 
entries (~300 bytes each, in this moment)....
Note also that I always have a good estimate of the number of items (in case I 
have to create a table) or the number of MB (for VLArrays).

thanks

d

---
Davide Cittaro
daweonl...@gmail.com
http://sites.google.com/site/davidecittaro/


------------------------------------------------------------------------------
Systems Optimization Self Assessment
Improve efficiency and utilization of IT resources. Drive out cost and 
improve service delivery. Take 5 minutes to use this Systems Optimization 
Self Assessment. http://www.accelacomm.com/jaw/sdnl/114/51450054/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to