I am building up a schema for storing a bunch of data about proteins, which
on a certain level can be modelled with quite simple tables. The problem is
that the database I am building needs to house lots of it >10TB and growing,
with one table in particular threatening to top 1TB. In the case of the
table and in the case of the overall database, the size can be expected to
grow quickly (and most of it can never be deleted).

In the past, with smaller tables, I have had success partitioning on a
64-bit crc hash that takes a more or less uniform distribution of input data
and pumps out a more-or-less uniform distribution of partitioned data with a
very small probability of collision. The hash itself is implemented as a c
add-on library, returns a BIGINT and serves as a candidate key for what for
our purposes we can call a protein record.

Now back to the big table, which relates two of these records (in a
theoretically symmetric way). Assuming I set the the table up as something

CREATE TABLE big_protein_relation_partition_dimA_dimB{
protein_id_a BIGINTEGER NOT NULL CHECK( bin_num(protein_id_a) = dimA ), ---
key (hash) from some table
protein_id_a BIGINTEGER NOT NULL CHECK( bin_num(protein_id_b) = dimB ), ---
key (hash) from some table

and do a little c bit-twiddling and define some binning mechanism on the

As near I can tell, binning out along the A and B dimensions into 256 bins,
I shouldn't be in any danger of running out of OIDs or anything like that
(despite having to deal with 2^16 tables). Theoretically, at least, I should
be able to do UNIONS along each axis (to avoid causing the analyzer too much
overhead) and use range exclusion to make my queries zip along with proper

Aside from running into a known bug with "too many triggers" when creating
gratuitous indices on these tables, I feel as it may be possible to do what
I want without breaking everything. But then again, am I taking too many
liberties with technology that maybe didn't have use cases like this one in


Jason Nerothin
Programmer/Analyst IV - Database Administration
UCLA-DOE Institute for Genomics & Proteomics
Howard Hughes Medical Institute
611 C.E. Young Drive East   | Tel: (310) 206-3907
105 Boyer Hall, Box 951570  | Fax: (310) 206-3914
Los Angeles, CA 90095. USA | Mail: [EMAIL PROTECTED]

Reply via email to