Dear Francesc,

thank you for your reply. I'll try to better explain my problem using  
real examples of data and code.

As I wrote I start with an input file. It contains a string of  
variable length (10e7-10e8). This string consists of four different  
characters (A,C,G,T), the bases of a DNA molecule.
The format of the input file is:

 >scaffold_0
AGCAGTGACAGATGACAGATGACAGATGACAGTGAC
AGCAGTGACAGATGACAGATGACAGATGACAGTGAC
AGCAGTGACAGATGACAGATGACAGATGACAGTGAC
... until 10e8 characters

Each character or base can be associated to a specific position. The  
first A has position 1, the second G 2 and so on.

Using pytables I can store all characters base by base in a structure  
like the following:

(1, A)
(2, G)
... and so on

Then I have a second file in which there are other strings and related  
positions. Reading this file, I have to update the table according to  
the position.
For example I read the at the position 2 I have another G, at position  
3 a C, at position 1 a G. According to the position I can associate:

(1, A) --> G
(2, G) --> G
(3, C) --> C

I can read the same position more than time, a variable number of time.

(1, A) --> GGGGAAAAAAAAAAA
(2, G) --> GGGGGGCGGG
(3, C) --> CCCCC

I cannot predict a priori the number of character to associate to each  
position.

As you suggested I tried to use a vlarray. In practice during the  
generation of the table I build also the vlarray in order to  
inizialize the structure.
The code I tried is the following:

from tables import *
from numpy import *

class NucSeq(IsDescription):
        id = Int32Col(pos=1)        # integer
        gnuc = StringCol(1, pos=2)   # 1-character String

# Open a file in "w"rite mode
fileh = openFile("table1.h5", mode = "w")
root = fileh.root
# Create a new group
group = fileh.createGroup(root, "newgroup")
# Create a new table in newgroup group
tableNuc = fileh.createTable(group, 'tableNuc', NucSeq, "tableNuc",  
Filters(1))
nucseq = tableNuc.row
vlarray = fileh.createVLArray(root, 'vlarray', StringAtom(itemsize=1),  
"vlarray test")
f=open("seq")
x=1
for i in f:
        if i[0]!=">":
                l=i.strip()
                for j in l:
                        nucseq['id']=x
                        nucseq['gnuc']=j
                        nucseq.append()
                        vlarray.append([])
                        x+=1
f.close()
tableNuc.flush()
fileh.close()

If I remove the vlarray, pytables can build the table in several  
seconds. Adding the vlarray the time increases and the same job can be  
completed after more than 20 hours.
In the code above I preferred to inizialize the structure because then  
I can quickly add each character calling the specific position.
If you need I could provide the "seq" file (it is 4MB after  
compression).

Thank you very much in advance for any help and suggestion.

Ernesto

PS: sorry for the late answer but I don't receive directly the reply.  
I don't know why.

------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to