Ernesto.
I agree, I think pytables would be a good solution for his full (with gaps)
multiple sequence alignment.
Regarding your problem, hopefully others with more experience can help with
optimization. Personally I've had troubles in similar situations where I
had large amounts of variable length data that had to be added
incrementally, or even worse, updated afterward.
Is Brent's suggestion helpful, though? Since it seems like you lose the
ordering of the sequences in the short reads as VLA anyway, can you just
store counts of nucleotides? Or will you be using this table to try to
build scaffolds (linked sequences) over the short reads? -- in that case
you must be relying on the start and stop genomic positions stored
elsewhere.
Rich
On Dec 13, 2009 8:41am, Ernesto <e.pica...@unical.it> wrote:
Hi Richard,
>This looks, in part, a way to store a huge multiple sequence alignment
with
>a reference sequence (the first character in ( ) being the DNA base in a
>reference DNA molecule, but due to the inequal lengths in each VLA, it
would
>seem that gaps are not stored, or stored elsewhere in some way, which
would
>be necessary to reconstruct the alignment.
I>s that right?
Yes, I'm searching a way to store a big multiple alignment. However, the
alignment has been generated between a genomic sequence (the dna string
in my example) and a huge amount of short reads (from next generation
sequencing).
>I'm curious because efficient retrieval of such multiple sequence
alignments
>is an issue for a colleague. I think he eventually stored each base of
~50
>full genomes (10^6 bases) as a separate mysql row with an index
position. I
>thought it would never work due to overhead but seems fast enough for his
>purposes (selecting arbitrary alignments of several Kbp for web display).
In the multiple alignment that you are describing I think that pytables
could be very useful. In effect you known a priori the number of genomes
and the length of the alignment. Therefore, you can build a table storing
position by position all nucleotides of a column.
In my case I don't known a priori the depth of each character, each
column could contain a variable number of bases.
Moreover, I need a quick method to store this information since the
number of short reads could be very huge (> 8-9 GB).
Ernesto
------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users