Hello Pytables wolrd! I am a python open source hacker and programmer. I need to store files metadata for several 100's of terabytes /billions of files and I am considering Pytables. Postgres is making the job too hard.
The metadata themselves are for files and directories, and represent a few terabytes for now up to the low 10 terabytes in the long run. There are inherently hierarchic in the sense that directory level metadata apply down to all child files/dirs unless overridden at a sub level. The metadata are characterized by a reasonably high level of redundancy: several files share the same value for a column, and in some cases a couple millions files do share the same value for a certain column/attribute. These highly duplicated columns need to be indexed for fast access (think about an IR-style inverted index at least conceptually), and are the keys used for the look-ups/queries. The metadata themselves can be either single values, or a list of values. Some node can have up to a few millions of values in a list of variable length. The metadata are otherwise mostly numbers of well defined types with a pseudo random distribution: the whole range of a numeric type is used. (typically 64 bits to 512 bits numbers) The metadata are mostly static: they are written once in batches of several 100MB, very rarely updated once written. The read load requires querying and possibly traversing the whole file-system-like metadata tree about 100 to a 1000 times per day. The response time for such queries is not critical as long as it takes less than 24 hours. The load can be spread on several (10 to 100) hosts as needed with data possibly replicated. The querying takes care of de-duplication on duplicated retrieved records. Is Pytable suitable for the job? Any tips? example of similar usage? Is the right approach to use the object tree to model the file system tree? (aka filenode? http://www.pytables.org/docs/manual/ch06.html ) though the file content is not meant to be stored in Pytables, only metadata. Any tool to help with replication/distribution on several hosts? I am not looking for getting complete answers right away of course, but any tips will be warmly welcomed! -- Cordially Philippe philippe ombredanne | 1 650 799 0949 | pombredanne at nexb.com nexB - Open by Design (tm) - http://www.nexb.com http://eclipse.org/atf - http://eclipse.org/soc - http://eclipse.org/vep http://drools.org/ - http://easyeclipse.org - http://phpeclipse.com ------------------------------------------------------------------------------ Enable your software for Intel(R) Active Management Technology to meet the growing manageability and security demands of your customers. Businesses are taking advantage of Intel(R) vPro (TM) technology - will your software be a part of the solution? Download the Intel(R) Manageability Checker today! http://p.sf.net/sfu/intel-dev2devmar _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users