OK. I now understand the problem. The bsddb module btree index is
screwing us over: when you simply ask for an iterator, it apparently
loads the entire index into memory. Anyway, just doing the following
causes the 30 MB increase in memory usage I mentioned above:
>>> s2 = classutil.open_shelve('R1.seqlen','r')
>>> it = iter(s2)
>>> seqID = it.next()
The memory increase happens when you ask the iterator for the first
item, and the memory isn't released until the iterator is garbage
collected.
The reason this problem was NOT present in earlier versions of Pygr,
is that we used to have a function read_fasta_one_line() that just
read the first sequence line of the FASTA file. BlastDB.set_seqtype()
used that function to read a line of sequence, and then to infer when
the sequence is protein or nucleotide.
When we made seqdb more modular (created SequenceDB class), I got rid
of read_fasta_one_line() as being too limited (only works on FASTA
format), and switched to just getting the first sequence by getting an
iterator on the sequence database. Now we discover that bsddb
iterators act more like keys() (i.e. reads the entire index into
memory) than like an iterator... They are NOT scalable!!!!
I think the writing is on the wall: we need to get rid of all
dependencies on bsddb. It's just not scalable. Since bsddb has been
removed from future versions of the Python Standard Library (starting
with 2.6), I guess we need to do this NOW rather than later.
Titus, what do you think?
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"pygr-dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---