Ernesto.

I agree, I think pytables would be a good solution for his full (with gaps) multiple sequence alignment.

Regarding your problem, hopefully others with more experience can help with optimization. Personally I've had troubles in similar situations where I had large amounts of variable length data that had to be added incrementally, or even worse, updated afterward.

Is Brent's suggestion helpful, though? Since it seems like you lose the ordering of the sequences in the short reads as VLA anyway, can you just store counts of nucleotides? Or will you be using this table to try to build scaffolds (linked sequences) over the short reads? -- in that case you must be relying on the start and stop genomic positions stored elsewhere.
Rich

On Dec 13, 2009 8:41am, Ernesto <e.pica...@unical.it> wrote:


Hi Richard,



>This looks, in part, a way to store a huge multiple sequence alignment with

>a reference sequence (the first character in ( ) being the DNA base in a

>reference DNA molecule, but due to the inequal lengths in each VLA, it would

>seem that gaps are not stored, or stored elsewhere in some way, which would

>be necessary to reconstruct the alignment.

I>s that right?



Yes, I'm searching a way to store a big multiple alignment. However, the alignment has been generated between a genomic sequence (the dna string in my example) and a huge amount of short reads (from next generation sequencing).



>I'm curious because efficient retrieval of such multiple sequence alignments

>is an issue for a colleague. I think he eventually stored each base of ~50

>full genomes (10^6 bases) as a separate mysql row with an index position. I

>thought it would never work due to overhead but seems fast enough for his

>purposes (selecting arbitrary alignments of several Kbp for web display).



In the multiple alignment that you are describing I think that pytables could be very useful. In effect you known a priori the number of genomes and the length of the alignment. Therefore, you can build a table storing position by position all nucleotides of a column.

In my case I don't known a priori the depth of each character, each column could contain a variable number of bases.

Moreover, I need a quick method to store this information since the number of short reads could be very huge (> 8-9 GB).



Ernesto





------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to