[pygr] First pass seqdb review FINISHED

C. Titus Brown Sat, 14 Mar 2009 18:14:40 -0700

Hi all,

I'm done with my first pass through seqdb.py.  The results are available
on github; as before, I changed only seqdb.py and seqdb_test.py.


---

You can see seqdb.py here,

   
http://github.com/ctb/pygr/blob/6f73203f6da3d85e24a958911765c07a896ce1de/pygr/seqdb.py

and the new test diff here,

   
http://github.com/ctb/pygr/commit/f4fc1d59025859d33b79c4b160697c645215059e#diff-3

---

I didn't find any more bugs -- darn it ;).

These changes increased the code coverage by about 20% to 85%, which is
getting pretty respectable!

Review comments follow.  Comments solicited; where I suggest backwards
incompatible changes, I don't see any likely breakages within pygr
itself, but I don't know what would happen within existing software
that uses pygr, so please let me know if there's a good reason to
keep things the way they are.

(If you want to just browse the source file, look for '@CTB' comments.)

After dealing with the issues below, and maybe figuring out how to test
the caching, I'll start writing up more documentation.

General comments
----------------

The class names in this file still confuse me.  Do we have any rules
on how to build names for classes in inheritance hierarcy?  e.g.

  FileDBSequence and SequenceFileDB -- which one is the sequence db,
  which one is the sequence??

The invert behavior is confusing (__invert__ points to
classutil.standard_invert which then redirects to self._inverse,
creating it from self._inverseClass if necessary).  Can we use
descriptors or something "simpler" here?

Have we decided anything about logging, as Istvan established in his
branch?  I'd like to get rid of explicit writes to sys.stderr, in
particular.

I'd like to establish some documentation or tests on exactly what
db.seqInfoDict should do.  So far I know that it should contain info
objects with a length attribute, for NLMSAs to use; and an id and
sequence, presumably.  What else?  Ultimately it'd be good to
establish a simple executable test for "proper" seqInfoDict behavior.

Specific issues
---------------

I don't like some aspects of the 'ifile' handling in SequenceFileDB.
In particular, I dislike the part where it re-opens the file to
re-read it and then closes 'ifile'; this may not be expected behavior
('ifile' itself is only used to grab the filename!) and I don't like
closing user-passed-in file handles.  Plus, pulling the filename out
of a file handle disallows StringIO and other "file-like" handles such
as sockets...  Is the ifile mechanism actually used anywhere?

The 'set_seqtype' function on SequenceDB depends on things being a FASTA file
instead of just iterating.  Why??  Can we remove that?

Also, 'set_seqtype' returns the sequence type as well as setting it on
the db object.  This behavior isn't used anywhere in the code base.  Can
we trash it?

In fact, it seems like 'set_seqtype' is used only within SequenceDB
and its children.  Why not rename it to be private, i.e. prepend it
with '_'?

Should SequenceDB._cacheMax be configurable?  Right now it's hardcoded.

I don't know what idFilter does in _store_seqlen_dict, and it seems
like it would be easier to just implement that in your own
seqfmt-based reader, anyway.  Can we trash it?  It's not used anywhere
in the code base.

The 'Sequence' classes in seqdb (e.g. FileDBSequence) should never be
created directly by a user, right?  Should we make them private by
prepending a '_'?

SeqPrefixUnionDict is oddly named.

PrefixUnionDict could contain __iadd__ behavior (instead of
SeqPrefixUnionDict, which inherits from it solely to add __iadd__
behavior).  Why not do things this way?

There's an 'if' statement in SeqPrefixUnionDict.__iadd__ that can
never be reached, as far as I can tell.  Could someone (Chris) please
verify that for me?

In two places (SeqPrefixUnionDict.__iadd__, find @CTB untested; and
_PrefixUnionDictInverse.__getitem__, find @CTB abstraction) explicit
knowledge of annotation objects is assumed in classes that otherwise
don't know anything special about them.  Any way to deal with this
better or should I just suck up and deal & write some tests?

A few PrefixUnionDict issues:

 - PrefixUnionDict.__contains__ can take either a sequence ID or a
   sequence object.  This is inconsistent with __getitem__ (which only
   takes seq IDs).  Ugly.

 - PrefixUnionDict.__init__ takes an argument 'trypath' that is really a
   list of paths to try.  Could we rename this to 'trypaths'?  Or should
   I just suck up and deal & write a nice doc?

 - PrefixUnionDict.writeHeaderFile function (etc.) isn't used anywhere
   in the code.  How useful is it?

 - PrefixUnionDict.newMemberDict(...) and the associated
   _PrefixUnionDictMember class don't have a clear purpose to me.
   Could someone give me a concrete use case?  If not, maybe they should
   be removed or something; they're confusing!

 - _PrefixUnionDictMember.default handling could probably be written
   in a more standard way using dict.setdefault, no?  How much code depends
   on this specific behavior?

The SliceDB, VirtualSeq, and VirtualSeqDB classes are untested and
unused in the rest of the code base.  I'd appreciate some guidance on
what they do, and perhaps some tests to get started.  (@CTB issue)

The SQLSequence stuff at the bottom of seqdb.py could probably go in a
different file, like sqlgraph.py.  They don't depend on anything in
seqdb.py.

cheers,
--titus
-- 
C. Titus Brown, c...@msu.edu

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to pygr-dev@googlegroups.com
To unsubscribe from this group, send email to 
pygr-dev+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[pygr] First pass seqdb review FINISHED

Reply via email to