Hello Pytables wolrd!
I am a python open source hacker and programmer.

I need to store files metadata for several 100's of terabytes /billions 
of files and I am considering Pytables. Postgres is making the job too hard.

The metadata themselves are for files and directories, and represent a 
few terabytes  for now up to the low 10 terabytes in the long run.
There are inherently hierarchic in the sense that directory level 
metadata apply down to all child files/dirs unless overridden at a sub 
level.
The metadata are characterized by a reasonably high level of redundancy: 
several files share the same value for a column, and in some cases a 
couple millions files do share the same value for a certain 
column/attribute.
These highly duplicated columns need to be indexed for fast access 
(think about an IR-style inverted index at least conceptually), and are 
the keys used for the look-ups/queries.
The metadata themselves can be either single values, or a list of 
values. Some node can have up to a few millions of values in a list of 
variable length.

The metadata are otherwise mostly numbers of well defined types with a 
pseudo random distribution: the whole range of a numeric type is used. 
(typically 64 bits to 512 bits numbers)

The metadata are mostly static: they are written once in batches of 
several 100MB, very rarely updated once written.

The read load requires querying and possibly traversing the whole 
file-system-like metadata tree about 100 to a 1000 times per day.
The response time for such queries is not critical as long as it takes 
less than 24 hours. The load can be spread on several (10 to 100) hosts 
as needed with data possibly replicated. The querying takes care of 
de-duplication on duplicated retrieved records.

Is Pytable suitable for the job?
Any tips? example of similar usage?
Is the right approach to use the object tree to model the file system 
tree? (aka filenode? http://www.pytables.org/docs/manual/ch06.html ) 
though the file content is not meant to be stored in Pytables, only 
metadata.

Any tool to help with replication/distribution on several hosts?
I am not looking for getting complete answers right away of course, but 
any tips will be warmly welcomed!


-- 
Cordially
Philippe

philippe ombredanne | 1 650 799 0949 | pombredanne at nexb.com
nexB - Open by Design (tm) - http://www.nexb.com
http://eclipse.org/atf - http://eclipse.org/soc - http://eclipse.org/vep
http://drools.org/ - http://easyeclipse.org - http://phpeclipse.com


------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software 
be a part of the solution? Download the Intel(R) Manageability Checker 
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to