[pygr] Re: First pass seqdb review FINISHED

Christopher Lee Wed, 18 Mar 2009 10:51:42 -0700

Thanks for all your detailed analysis, Titus!  Comments below...

On Mar 14, 2009, at 6:14 PM, C. Titus Brown wrote:


> I didn't find any more bugs -- darn it ;).

Just because you're not paranoid doesn't mean they aren't out to ...  
whatever.
>
>
> These changes increased the code coverage by about 20% to 85%, which  
> is
> getting pretty respectable!

I'm sure Oscar Wilde said something witty and cutting about  
respectability, but it has slipped my mind.

>
> The class names in this file still confuse me.  Do we have any rules
> on how to build names for classes in inheritance hierarcy?  e.g.

I believe Steve McConnell's books (e.g. Code Complete) cover these  
naming convention issues.  We could look to see what he says.

>  FileDBSequence and SequenceFileDB -- which one is the sequence db,
>  which one is the sequence??

I guess I'm just following English, e.g. "sequence database" is a  
database; vs. "database sequence" is a sequence.  If someone has a  
better convention, please describe it.

>
>
> The invert behavior is confusing (__invert__ points to
> classutil.standard_invert which then redirects to self._inverse,
> creating it from self._inverseClass if necessary).  Can we use
> descriptors or something "simpler" here?

Please elaborate further; I don't understand what you're suggesting.   
The Python standard defines __invert__ as the invert operator method  
name; I follow that standard.  How can a descriptor help us here?

>
>
> Have we decided anything about logging, as Istvan established in his
> branch?  I'd like to get rid of explicit writes to sys.stderr, in
> particular.

Seems like a worthy goal.  Has anyone entered an issue for this in the  
tracker?  I'm not aware that anyone has explicitly proposed this for  
0.8.

>
>
> I'd like to establish some documentation or tests on exactly what
> db.seqInfoDict should do.  So far I know that it should contain info
> objects with a length attribute, for NLMSAs to use; and an id and
> sequence, presumably.  What else?  Ultimately it'd be good to
> establish a simple executable test for "proper" seqInfoDict behavior.

Hmm... I certainly don't envision *requiring* a "sequence" attribute.   
I made seqInfoDict a standard mechanism for getting information about  
a specific sequence ID without having to do the full set of operations  
for instantiating a sequence object (which, depending on the backend  
implementation, might involve actually retrieving the sequence, and  
thus could be slow).  The only seqInfoDict value attribute currently  
required in Pygr is the length.  However you and Alex Nolley pointed  
out the value of being able to associate arbitrary attributes with  
sequence objects (initially, the FASTA description line), so I  
formulated seqInfoDict as a general-purpose mechanism for such  
attributes.

I suggest that the list of attributes required on all seqInfoDict  
values be minimal -- only what Pygr core code simply can't live  
without.  (e.g. in the case of NLMSA, it must know the length in order  
to pack the coordinate system).  Of course, users' code can look for  
other seqInfoDict attributes if they need to.  We should just define  
consistent names, so that different backends will use the same  
attribute name for a given kind of data (e.g. free-form text  
description like the FASTA description line could be called  
"description").  But unless there is a compelling reason why Pygr's  
core code needs the "description", I don't think we should add a test  
that asserts the absence of that attribute is a bug.

So the executable test you're proposing only needs to look for the  
"length" attribute.

>
>
> Specific issues
> ---------------
>
> I don't like some aspects of the 'ifile' handling in SequenceFileDB.
> In particular, I dislike the part where it re-opens the file to
> re-read it and then closes 'ifile'; this may not be expected behavior
> ('ifile' itself is only used to grab the filename!) and I don't like
> closing user-passed-in file handles.  Plus, pulling the filename out
> of a file handle disallows StringIO and other "file-like" handles such
> as sockets...  Is the ifile mechanism actually used anywhere?

This was used by some coordinator module functionality that is made  
obsolete by the much more powerful capabilities of pygr.Data.  We can  
certainly get rid of this ifile mechanism.

>
>
> The 'set_seqtype' function on SequenceDB depends on things being a  
> FASTA file
> instead of just iterating.  Why??  Can we remove that?

The reason this exists is historical.  At one point Namshin reported  
that opening a sequence database had become very slow, and it was due  
to the set_seqtype() method invoking the iterator instead of our  
original read_fasta_one_line(), which I had removed during our seqdb  
refactoring.  The seqdb iterator invoked the Python shelve __iter__(),  
which turned out to load the entire list of keys into memory before  
returning the iterator; very slow for solexa data!  To get a fix to  
Namshin quickly, I first restored the use of read_fasta_one_line,  
which solved the problem.  I later was able to solve the shelve  
problem; because we had previously found btree to be more scalable  
than the default hash-based shelve, we had created our own subclass of  
shelve.  So I just added a proper __iter__() method to our subclass.

Conclusion: we could again get rid of read_fasta_one_line().

>
>
> Also, 'set_seqtype' returns the sequence type as well as setting it on
> the db object.  This behavior isn't used anywhere in the code base.   
> Can
> we trash it?

Sounds fine to me.

>
>
> In fact, it seems like 'set_seqtype' is used only within SequenceDB
> and its children.  Why not rename it to be private, i.e. prepend it
> with '_'?

Fine by me.  Actually, this seems like a topic ripe for code review.   
Python is not that zealous about the whole public vs. private  
distinction, and I've accordingly been lax about this.  We could go  
through the code and designate all purely internal methods, then  
prefix them with _ .

>
>
> Should SequenceDB._cacheMax be configurable?  Right now it's  
> hardcoded.

Sure, why not.

>
>
> I don't know what idFilter does in _store_seqlen_dict, and it seems
> like it would be easier to just implement that in your own
> seqfmt-based reader, anyway.  Can we trash it?  It's not used anywhere
> in the code base.

idFilter addressed the nasty reality that one major database source  
(NCBI) abuses their ID field (treating it as a blob into which they  
can cram as many junk fields as they want... and then blastall returns  
an ID that does not match the original in the file).  idFilter allows  
you to do any transformations on the input IDs that you want.   
However, as a result of the seqdb refactoring, this may no longer be  
needed in the sequence file db class itself.  All the code for the  
NCBI blast ID munging got moved to blast.py and is handled there.  We  
could get rid of idFilter in _store_seqlen_dict.

>
>
> The 'Sequence' classes in seqdb (e.g. FileDBSequence) should never be
> created directly by a user, right?  Should we make them private by
> prepending a '_'?

Throughout pygr we allow users to pass an itemClass to a database  
constructor, so the item classes are public.  The _init_subclass  
pattern tended to modularize all the backend details in the item  
class, making the database class very generic (works with any  
itemClass).  Based on your advice, I've reversed that policy, so that  
the backend details are on the database class.  But still, I can  
easily imagine scenarios in which a user would want to subclass  
FileDBSequence, so I don't see these as private.

>
> The SliceDB, VirtualSeq, and VirtualSeqDB classes are untested and
> unused in the rest of the code base.  I'd appreciate some guidance on
> what they do, and perhaps some tests to get started.  (@CTB issue)

I don't think anyone has used these classes in at least a couple of  
years.  We could delete them.

>
>
> The SQLSequence stuff at the bottom of seqdb.py could probably go in a
> different file, like sqlgraph.py.  They don't depend on anything in
> seqdb.py.

That seems like a reasonable idea.

I'll answer the PrefixUnionDict questions in a separate email.

Yours with thanks,

Chris


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to pygr-dev@googlegroups.com
To unsubscribe from this group, send email to 
pygr-dev+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[pygr] Re: First pass seqdb review FINISHED

Reply via email to