Hi Chris,
Yes, it solved memory usage problem. Still, there are many problems to
solve.
==> This is my script
myslices = pygr.Data.Collection(filename = os.path.join(wDir, 'gccontents',
'reference_gc.cdb'), \
intKeys = True, mode = 'c', writeback = False)
mydb = seqdb.AnnotationDB(myslices, hg18, sliceAttrDict = dict(id = 0, gc_id
= 1, orientation = 2, start = 3, \
stop = 4, gc_content = 5))
msa = cnestedlist.NLMSA(os.path.join(wDir, 'gccontents', 'reference_gc'),
'w', \
pairwiseMode = True, bidirectional = False)
==> This is the results
-rw-rw-rw- 1 deepreds NIS 4771545088 2008-09-14 15:47 reference_gc0.build
-rw-rw-rw- 1 deepreds NIS 0 2008-09-13 12:44 reference_gc1.build
-rw-rw-rw- 1 deepreds NIS 30753996800 2008-09-14 15:47 reference_gc.cdb
-rw-rw-rw- 1 deepreds NIS 19864403968 2008-09-14 15:47 reference_gc.idDict
-rw-rw-rw- 1 deepreds NIS 22845374464 2008-09-14 15:47
reference_gc.seqIDdict
==> HDF5, h5py module, simple enough!
sizedict = {}
for lines in open('reference.tab', 'r').xreadlines():
chrid, chrsize = lines.strip().split('\t')
chrsize = int(chrsize)
sizedict[chrid] = chrsize
hdf = h5py.File('reference_gccontents.hdf5', 'w')
chrList = ["chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8",
"chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16",
"chr17", "chr18", "chr19", "chr20", "chr21", "chr22", "chrX", "chrY",
"chrM"]
for chrid in chrList:
chrsize = sizedict[chrid]
myarr = numpy.zeros((chrsize,), '=i1')
for lines in open('reference_%s_36.gc' % chrid, 'r').xreadlines():
j1, chrsite, mygc = lines.strip().split('\t')
chrsite, mygc = int(chrsite), int(mygc)
myarr[chrsite] = mygc
hdf[chrid] = myarr
print repr(hdf)
hdf.close()
==> HDF5 version
-rw-rw-rw- 1 deepreds NIS 3080446291 2008-09-10 18:16
reference_gccontents.hdf5
-rw-rw-rw- 1 deepreds NIS 3080446291 2008-09-09 17:52 reference.hdf5
As you can see, I didn't set the maxlen and maxint, but file size for .build
is 4.7GB and still growing. The Collection is ~ 30GB. Data type is simple. I
calculated all GC contents by every 36bp window for all chromosome. Thus,
the number of annotation is 3GB (larger than 32bit integer). And, the
annotations never overlap each other. Actually, what I need is 1 byte (8bit)
integer for every position. I made this database using HDF5 and the output
file size is exactly same as sequences, 3GB. AFAIK pygr saves 24bit for
single annotation, three times larger than HDF5 implementation.
For integer keys, what if we give ID according to their FSEEK position using
64bit integer? In this case, we don't have to make indexing table. We can
save all annotations by 64bit integer ID, which can be directly referenced
to their FSEEK position. If someone wants to use integer key, it means they
don't care the number itself, because they are meaningless.
There are a lot of cases where I need this kind of annotation database. I
hope pygr will give solutions.
Thanks,
Namshin Kim
On Sun, Sep 14, 2008 at 2:49 AM, Christopher Lee <[EMAIL PROTECTED]> wrote:
>
> Hi Namshin,
> does the latest source code from the git repository solve your memory
> usage problem?
>
> -- Chris
>
> >
>
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"pygr-dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---