OK.  I now understand the problem.  The bsddb module btree index is  
screwing us over: when you simply ask for an iterator, it apparently  
loads the entire index into memory.  Anyway, just doing the following  
causes the 30 MB increase in memory usage I mentioned above:

 >>> s2 = classutil.open_shelve('R1.seqlen','r')
 >>> it = iter(s2)
 >>> seqID = it.next()

The memory increase happens when you ask the iterator for the first  
item, and the memory isn't released until the iterator is garbage  
collected.

The reason this problem was NOT present in earlier versions of Pygr,  
is that we used to have a function read_fasta_one_line() that just  
read the first sequence line of the FASTA file.  BlastDB.set_seqtype()  
used that function to read a line of sequence, and then to infer when  
the sequence is protein or nucleotide.

When we made seqdb more modular (created SequenceDB class), I got rid  
of read_fasta_one_line() as being too limited (only works on FASTA  
format), and switched to just getting the first sequence by getting an  
iterator on the sequence database.  Now we discover that bsddb  
iterators act more like keys() (i.e. reads the entire index into  
memory) than like an iterator...  They are NOT scalable!!!!

I think the writing is on the wall: we need to get rid of all  
dependencies on bsddb.  It's just not scalable.  Since bsddb has been  
removed from future versions of the Python Standard Library (starting  
with 2.6), I guess we need to do this NOW rather than later.

Titus, what do you think?

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to