Thanks Josh (I have evidently sent a mail from the wrong address...) On Dec 13, 2011, at 11:14 AM, PyTables Org wrote:
> Forwarding to the list. ~Josh. > > Begin forwarded message: >> Hi all, I need to store millions of entries made by a n-long string, a n >> array of integers and an ID for every string/array couple. Are there best >> practices to structure a proper pytable for this? Also, does the structure >> influence compression efficiency? I'm trying some combination... the dummiest implementation is a table with three columns, each made by a string (yes, the array of integer is actually another string I read from a file). I achieve a good compression and reasonable speed. I now want to push the whole thing so I have to understand a couple of things. First of all assume that each data entry is made like this (ID, S, Q) where ID is the entry id (unique), S is a fixed size string made by a limited alphabet (5 letters), Q is a fixed size string made by limited alphabet (90 letters), len(S) == len(Q). - does it make any difference to store S (or Q) as a string or a uint8 array? - I tried to store the same thing into 3 VLArrays and everything was slower and resulting file was bigger, since length is fixed I thought I could use CArray but I don't know which may be the best strategy to assign a CArray (two, actually) for each entry... can I simply build a table with array as column elements? - does it make any difference to store the ID as a "value" (i.e. an item in a table column or array) or as a name for the item? I thought I could create a file with two groups (/Q and /S), for each group I tried to add a CArray /Q/ID1, /Q/ID2... is it a good choice? - When does Blosc perform worse than zlib? I mean, Except for my tests :-P Note that len(S) << 10000 and len(ID) < 100. I have millions of small data entries (~300 bytes each, in this moment).... Note also that I always have a good estimate of the number of items (in case I have to create a table) or the number of MB (for VLArrays). thanks d --- Davide Cittaro daweonl...@gmail.com http://sites.google.com/site/davidecittaro/ ------------------------------------------------------------------------------ Systems Optimization Self Assessment Improve efficiency and utilization of IT resources. Drive out cost and improve service delivery. Take 5 minutes to use this Systems Optimization Self Assessment. http://www.accelacomm.com/jaw/sdnl/114/51450054/ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users