On Sat, Dec 12, 2009 at 9:45 AM, Ernesto <e.pica...@unical.it> wrote: > Dear Francesc, > > thank you for your reply. I'll try to better explain my problem using > real examples of data and code. > > As I wrote I start with an input file. It contains a string of > variable length (10e7-10e8). This string consists of four different > characters (A,C,G,T), the bases of a DNA molecule. > The format of the input file is: > > >scaffold_0 > AGCAGTGACAGATGACAGATGACAGATGACAGTGAC > AGCAGTGACAGATGACAGATGACAGATGACAGTGAC > AGCAGTGACAGATGACAGATGACAGATGACAGTGAC > ... until 10e8 characters > > Each character or base can be associated to a specific position. The > first A has position 1, the second G 2 and so on. > > Using pytables I can store all characters base by base in a structure > like the following: > > (1, A) > (2, G) > ... and so on
i apologize if i misunderstand the problem, but if you are just keeping track of counts, and not order, you can use a "column" for each base pair which is initialized to zero. in numpy: >>> a = np.zeros((1000000, 4), dtype=[('A', int), ('C', int), ('G', int), ('T', >>> int)] >>> a[0]['A'] += 1 >>> a[1]['G'] += 1 >>> a[1]['A'] += 1 etc. that is easily stored in pytables. -brentp > > Then I have a second file in which there are other strings and related > positions. Reading this file, I have to update the table according to > the position. > For example I read the at the position 2 I have another G, at position > 3 a C, at position 1 a G. According to the position I can associate: > > (1, A) --> G > (2, G) --> G > (3, C) --> C > > I can read the same position more than time, a variable number of time. > > (1, A) --> GGGGAAAAAAAAAAA > (2, G) --> GGGGGGCGGG > (3, C) --> CCCCC > > I cannot predict a priori the number of character to associate to each > position. > > As you suggested I tried to use a vlarray. In practice during the > generation of the table I build also the vlarray in order to > inizialize the structure. > The code I tried is the following: > > from tables import * > from numpy import * > > class NucSeq(IsDescription): > id = Int32Col(pos=1) # integer > gnuc = StringCol(1, pos=2) # 1-character String > > # Open a file in "w"rite mode > fileh = openFile("table1.h5", mode = "w") > root = fileh.root > # Create a new group > group = fileh.createGroup(root, "newgroup") > # Create a new table in newgroup group > tableNuc = fileh.createTable(group, 'tableNuc', NucSeq, "tableNuc", > Filters(1)) > nucseq = tableNuc.row > vlarray = fileh.createVLArray(root, 'vlarray', StringAtom(itemsize=1), > "vlarray test") > f=open("seq") > x=1 > for i in f: > if i[0]!=">": > l=i.strip() > for j in l: > nucseq['id']=x > nucseq['gnuc']=j > nucseq.append() > vlarray.append([]) > x+=1 > f.close() > tableNuc.flush() > fileh.close() > > If I remove the vlarray, pytables can build the table in several > seconds. Adding the vlarray the time increases and the same job can be > completed after more than 20 hours. > In the code above I preferred to inizialize the structure because then > I can quickly add each character calling the specific position. > If you need I could provide the "seq" file (it is 4MB after > compression). > > Thank you very much in advance for any help and suggestion. > > Ernesto > > PS: sorry for the late answer but I don't receive directly the reply. > I don't know why. > > ------------------------------------------------------------------------------ > Return on Information: > Google Enterprise Search pays you back > Get the facts. > http://p.sf.net/sfu/google-dev2dev > _______________________________________________ > Pytables-users mailing list > Pytables-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/pytables-users > ------------------------------------------------------------------------------ Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users