For documents that long I would seriously consider using large objects and
refencing them with their OIDs. Text fields get put in a special location
within the database. It's similar (possibly exactly) to using large
objects. Also, you can potentially compress them to save space on write and
read. 1gb of text is a lot of text.

See https://www.postgresql.org/docs/current/largeobjects.html

Thanks,
Ben

On Wed, Apr 12, 2023, 1:20 PM Joe Carlson <jwcarl...@lbl.gov> wrote:

> I’ve certainly thought about using a different representation. A factor of
> 2x would be good, for a while anyway. For nucleotide sequence, we’d need to
> consider a 10 character alphabet (A, C, G, T, N and the lower case forms
> when representing ’soft masked’ sequence*). So it would be 2 bases/byte.
> (Proteins are not nearly so long so a straight storage is simpler.) But
> these would be bigger changes on the client side than storing in chunks so
> I think this is the way to go.
>
> We’re working with plant genomes, which compared to human chromosomes, are
> HUGE. One chromosome of fava bean is over a gig. And pine tree is another
> monster. This, together with the fact that sequence data collection and
> assembly have improved so much in the past couple years has forced us to
> rethink a lot of our data storage assumptions.
>
> * for those curious, especially in plants, much of sequence consists of
> repetitive element that are remnants of ancient viruses, simple repeats and
> the like. For people who want to identify particular functional components
> in a genome, they typically do not want to search against this sequence but
> restrict searching to coding regions. But the repetitive sequence is still
> important and we need to keep it.
>
> > On Apr 12, 2023, at 10:04 AM, Mark Dilger <mark.dil...@enterprisedb.com>
> wrote:
> >
> >
> >
> >> On Apr 12, 2023, at 7:59 AM, Joe Carlson <jwcarl...@lbl.gov> wrote:
> >>
> >> The use case is genomics. Extracting substrings is common. So going to
> chunked storage makes sense.
> >
> > Are you storing nucleotide sequences as text strings?  If using the
> simple 4-character (A,C,G,T) alphabet, you can store four bases per byte.
> If using a nucleotide code 16-character alphabet you can still get two
> bases per byte.  An amino acid 20-character alphabet can be stored 8 bases
> per 5 bytes, and so forth.  Such a representation might allow you to store
> sequences two or four times longer than the limit you currently hit, but
> then you are still at an impasse.  Would a factor or 2x or 4x be enough for
> your needs?
> >
> > —
> > Mark Dilger
> > EnterpriseDB: http://www.enterprisedb.com
> > The Enterprise PostgreSQL Company
> >
> >
> >
>
>
>
>

Reply via email to