I'll join in the guessing game. This looks, in part, a way to store a huge multiple sequence alignment with a reference sequence (the first character in ( ) being the DNA base in a reference DNA molecule, but due to the inequal lengths in each VLA, it would seem that gaps are not stored, or stored elsewhere in some way, which would be necessary to reconstruct the alignment. Is that right?
I'm curious because efficient retrieval of such multiple sequence alignments is an issue for a colleague. I think he eventually stored each base of ~50 full genomes (10^6 bases) as a separate mysql row with an index position. I thought it would never work due to overhead but seems fast enough for his purposes (selecting arbitrary alignments of several Kbp for web display). On Sat, Dec 12, 2009 at 11:35 AM, Faisal Moledina <faisal.moled...@gmail.com > wrote: > I'm going to stab at understanding your problem. Correct me where I'm > wrong. > > On Sat, Dec 12, 2009 at 12:45 PM, Ernesto <e.pica...@unical.it> wrote: > > As I wrote I start with an input file. It contains a string of > > variable length (10e7-10e8). This string consists of four different > > characters (A,C,G,T), the bases of a DNA molecule. > > The format of the input file is: > > > > >scaffold_0 > > AGCAGTGACAGATGACAGATGACAGATGACAGTGAC > > AGCAGTGACAGATGACAGATGACAGATGACAGTGAC > > AGCAGTGACAGATGACAGATGACAGATGACAGTGAC > > ... until 10e8 characters > > > > Each character or base can be associated to a specific position. The > > first A has position 1, the second G 2 and so on. > > > > Using pytables I can store all characters base by base in a structure > > like the following: > > > > (1, A) > > (2, G) > > ... and so on > > Continuing with "…and so on" does this mean that C, A, G, T in the > above sequence get stored as 3, 4, 5, 6 or as 3, 1, 2, 4. That is, the > "position" literally means the position in the DNA sequence string or > are you counting how many of each base you have? > > > Then I have a second file in which there are other strings and related > > positions. Reading this file, I have to update the table according to > > the position. > > For example I read the at the position 2 I have another G, at position > > 3 a C, at position 1 a G. According to the position I can associate: > > > > (1, A) --> G > > (2, G) --> G > > (3, C) --> C > > > > I can read the same position more than time, a variable number of time. > > > > (1, A) --> GGGGAAAAAAAAAAA > > (2, G) --> GGGGGGCGGG > > (3, C) --> CCCCC > > > > Again, I'm confused by the position. Are you trying to match up bases > together (doesn't look like it) or match up positions in each file? > And if it's the latter, where does the variable length come from given > that each file of 1e8 bp should have at least positions 1-3, no? > > What is the sequence contained in the second file? It's hard to follow > how the bases get assigned to these positions without it. If possible, > can you provide a few sequences that are around 10-15 bp in length and > work through a full example of what you would like your tables and > vlarrays to look like in the end? Hopefully that will help us sort it > out. > > Faisal > > > ------------------------------------------------------------------------------ > Return on Information: > Google Enterprise Search pays you back > Get the facts. > http://p.sf.net/sfu/google-dev2dev > _______________________________________________ > Pytables-users mailing list > Pytables-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/pytables-users >
------------------------------------------------------------------------------ Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev
_______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users