/Must/ the genome all be in one big file, or can you store them one line per table row?

On 4/12/23 12:19, Joe Carlson wrote:
I’ve certainly thought about using a different representation. A factor of 2x 
would be good, for a while anyway. For nucleotide sequence, we’d need to 
consider a 10 character alphabet (A, C, G, T, N and the lower case forms when 
representing ’soft masked’ sequence*). So it would be 2 bases/byte. (Proteins 
are not nearly so long so a straight storage is simpler.) But these would be 
bigger changes on the client side than storing in chunks so I think this is the 
way to go.

We’re working with plant genomes, which compared to human chromosomes, are 
HUGE. One chromosome of fava bean is over a gig. And pine tree is another 
monster. This, together with the fact that sequence data collection and 
assembly have improved so much in the past couple years has forced us to 
rethink a lot of our data storage assumptions.

* for those curious, especially in plants, much of sequence consists of 
repetitive element that are remnants of ancient viruses, simple repeats and the 
like. For people who want to identify particular functional components in a 
genome, they typically do not want to search against this sequence but restrict 
searching to coding regions. But the repetitive sequence is still important and 
we need to keep it.

On Apr 12, 2023, at 10:04 AM, Mark Dilger<mark.dil...@enterprisedb.com>  wrote:



On Apr 12, 2023, at 7:59 AM, Joe Carlson<jwcarl...@lbl.gov>  wrote:

The use case is genomics. Extracting substrings is common. So going to chunked 
storage makes sense.
Are you storing nucleotide sequences as text strings?  If using the simple 
4-character (A,C,G,T) alphabet, you can store four bases per byte.  If using a 
nucleotide code 16-character alphabet you can still get two bases per byte.  An 
amino acid 20-character alphabet can be stored 8 bases per 5 bytes, and so 
forth.  Such a representation might allow you to store sequences two or four 
times longer than the limit you currently hit, but then you are still at an 
impasse.  Would a factor or 2x or 4x be enough for your needs?

—
Mark Dilger
EnterpriseDB:http://www.enterprisedb.com
The Enterprise PostgreSQL Company






--
Born in Arizona, moved to Babylonia.

Reply via email to