Re: [Pytables-users] using vlarray

Ernesto Sun, 13 Dec 2009 01:12:53 -0800

I'm going to stab at understanding your problem. Correct me whereI'm wrong.


On Sat, Dec 12, 2009 at 12:45 PM, Ernesto <[email protected]> wrote:

As I wrote I start with an input file. It contains a string of
variable length (10e7-10e8). This string consists of four different
characters (A,C,G,T), the bases of a DNA molecule.
The format of the input file is:


 >scaffold_0
AGCAGTGACAGATGACAGATGACAGATGACAGTGAC
AGCAGTGACAGATGACAGATGACAGATGACAGTGAC
AGCAGTGACAGATGACAGATGACAGATGACAGTGAC
... until 10e8 characters

Each character or base can be associated to a specific position. The
first A has position 1, the second G 2 and so on.

Using pytables I can store all characters base by base in a structure
like the following:

(1, A)
(2, G)
... and so on


Continuing with "…and so on" does this mean that C, A, G, T in the
above sequence get stored as 3, 4, 5, 6 or as 3, 1, 2, 4. That is, the
"position" literally means the position in the DNA sequence string or
are you counting how many of each base you have?

Each character has a specific position that is the position in the DNAsequence string.

Continuing the above structure I should have:

(3, C)
(4, A)
(5, G)
(6, T)

until the last character of the string. In a complete human genome wehave more than 3x10e9 bases (characters).I read each chromosome a time and thus the size is in the range10e7-10e8.

Then I have a second file in which there are other strings andrelated
positions. Reading this file, I have to update the table according to
the position.
For example I read the at the position 2 I have another G, atposition
3 a C, at position 1 a G. According to the position I can associate:

(1, A) --> G
(2, G) --> G
(3, C) --> C
I can read the same position more than time, a variable number oftime.
(1, A) --> GGGGAAAAAAAAAAA
(2, G) --> GGGGGGCGGG
(3, C) --> CCCCC


Again, I'm confused by the position. Are you trying to match up bases
together (doesn't look like it) or match up positions in each file?
And if it's the latter, where does the variable length come from given
that each file of 1e8 bp should have at least positions 1-3, no?

What is the sequence contained in the second file? It's hard to follow
how the bases get assigned to these positions without it. If possible,
can you provide a few sequences that are around 10-15 bp in length and
work through a full example of what you would like your tables and
vlarrays to look like in the end? Hopefully that will help us sort it
out.

Faisal

Here I have to clarify the structure of the second file. This fileannotates short strings that align with the DNA string above.Align means that there is a corrispondence character by character. Forexample:


dna string     AGTGACGATGACGATGACAGTGACAGTGCGTGCAGT
short string        CGATGACGATG

The short string aligns with the DNA string. The alignment starts atposition 6 of dna string and ends at position 16.In the second file I have a lot of such short strings that alignagainst the dna string (the alignment has been generated by aspecialized software).For each string I know the start and end position of the aligment andthe sequence of the short string.What currently happens is that different and independent shortsequences align in the same dna region. Example:


dna string     AGTGACGATGACGATGACAGTGACAGTGCGTGCAGT
short string1       CGATGACGATG
short string2        GATGACGATGA
short string3      ACGATGACGAT

What I need is to store for each dna string position the characterassociated from short strings.In the example above I should have (for the first ten positions of thedna string):


1, A, []
2, G, []
3, T, []
4, C, []
5, A, [A]
6, C, [C,C]
7, G, [G,G,G]
8, A, [A,A,A]
9, T, [T,T,T]
10, G, [G,G,G]
until the end of the dna string.

The first number in the position of the character in the dna string,the second character is the dna character, the third list is the listof characters that map at that specific position in the dna string.The size of the list is variable because I do't know a priori how manynucleotides can be allocated at each position.

In the code I attached, I inizialized a structure like the one above,appending empty arrays at each position. If I store only the positionand the corresponding dna character, pytables is very fast. When Iappend an empty array per position, its speed drops drammatically.Inizializing the structure above, give me the possibility to add acharacter every time it is encountered during the parsing of thesecond file. The allocation of each character should be fast since Iknow the position. In the example above, if I have to add an A atposition 5, I could simply update the vlarray[4].


Ernesto

------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev

_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] using vlarray

Reply via email to