On Sat, Dec 12, 2009 at 9:45 AM, Ernesto <e.pica...@unical.it> wrote:
> Dear Francesc,
>
> thank you for your reply. I'll try to better explain my problem using
> real examples of data and code.
>
> As I wrote I start with an input file. It contains a string of
> variable length (10e7-10e8). This string consists of four different
> characters (A,C,G,T), the bases of a DNA molecule.
> The format of the input file is:
>
>  >scaffold_0
> AGCAGTGACAGATGACAGATGACAGATGACAGTGAC
> AGCAGTGACAGATGACAGATGACAGATGACAGTGAC
> AGCAGTGACAGATGACAGATGACAGATGACAGTGAC
> ... until 10e8 characters
>
> Each character or base can be associated to a specific position. The
> first A has position 1, the second G 2 and so on.
>
> Using pytables I can store all characters base by base in a structure
> like the following:
>
> (1, A)
> (2, G)
> ... and so on

i apologize if i misunderstand the problem, but
if you are just keeping track of counts, and not order, you can use a
"column" for each base pair which is initialized to zero. in numpy:

>>> a = np.zeros((1000000, 4), dtype=[('A', int), ('C', int), ('G', int), ('T', 
>>> int)]
>>> a[0]['A'] += 1
>>> a[1]['G'] += 1
>>> a[1]['A'] += 1

etc.
that is easily stored in pytables.

-brentp

>
> Then I have a second file in which there are other strings and related
> positions. Reading this file, I have to update the table according to
> the position.
> For example I read the at the position 2 I have another G, at position
> 3 a C, at position 1 a G. According to the position I can associate:
>
> (1, A) --> G
> (2, G) --> G
> (3, C) --> C
>
> I can read the same position more than time, a variable number of time.
>
> (1, A) --> GGGGAAAAAAAAAAA
> (2, G) --> GGGGGGCGGG
> (3, C) --> CCCCC
>
> I cannot predict a priori the number of character to associate to each
> position.
>
> As you suggested I tried to use a vlarray. In practice during the
> generation of the table I build also the vlarray in order to
> inizialize the structure.
> The code I tried is the following:
>
> from tables import *
> from numpy import *
>
> class NucSeq(IsDescription):
>        id = Int32Col(pos=1)        # integer
>        gnuc = StringCol(1, pos=2)   # 1-character String
>
> # Open a file in "w"rite mode
> fileh = openFile("table1.h5", mode = "w")
> root = fileh.root
> # Create a new group
> group = fileh.createGroup(root, "newgroup")
> # Create a new table in newgroup group
> tableNuc = fileh.createTable(group, 'tableNuc', NucSeq, "tableNuc",
> Filters(1))
> nucseq = tableNuc.row
> vlarray = fileh.createVLArray(root, 'vlarray', StringAtom(itemsize=1),
> "vlarray test")
> f=open("seq")
> x=1
> for i in f:
>        if i[0]!=">":
>                l=i.strip()
>                for j in l:
>                        nucseq['id']=x
>                        nucseq['gnuc']=j
>                        nucseq.append()
>                        vlarray.append([])
>                        x+=1
> f.close()
> tableNuc.flush()
> fileh.close()
>
> If I remove the vlarray, pytables can build the table in several
> seconds. Adding the vlarray the time increases and the same job can be
> completed after more than 20 hours.
> In the code above I preferred to inizialize the structure because then
> I can quickly add each character calling the specific position.
> If you need I could provide the "seq" file (it is 4MB after
> compression).
>
> Thank you very much in advance for any help and suggestion.
>
> Ernesto
>
> PS: sorry for the late answer but I don't receive directly the reply.
> I don't know why.
>
> ------------------------------------------------------------------------------
> Return on Information:
> Google Enterprise Search pays you back
> Get the facts.
> http://p.sf.net/sfu/google-dev2dev
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to