Hi all, I'm done with my first pass through seqdb.py. The results are available on github; as before, I changed only seqdb.py and seqdb_test.py.
--- You can see seqdb.py here, http://github.com/ctb/pygr/blob/6f73203f6da3d85e24a958911765c07a896ce1de/pygr/seqdb.py and the new test diff here, http://github.com/ctb/pygr/commit/f4fc1d59025859d33b79c4b160697c645215059e#diff-3 --- I didn't find any more bugs -- darn it ;). These changes increased the code coverage by about 20% to 85%, which is getting pretty respectable! Review comments follow. Comments solicited; where I suggest backwards incompatible changes, I don't see any likely breakages within pygr itself, but I don't know what would happen within existing software that uses pygr, so please let me know if there's a good reason to keep things the way they are. (If you want to just browse the source file, look for '@CTB' comments.) After dealing with the issues below, and maybe figuring out how to test the caching, I'll start writing up more documentation. General comments ---------------- The class names in this file still confuse me. Do we have any rules on how to build names for classes in inheritance hierarcy? e.g. FileDBSequence and SequenceFileDB -- which one is the sequence db, which one is the sequence?? The invert behavior is confusing (__invert__ points to classutil.standard_invert which then redirects to self._inverse, creating it from self._inverseClass if necessary). Can we use descriptors or something "simpler" here? Have we decided anything about logging, as Istvan established in his branch? I'd like to get rid of explicit writes to sys.stderr, in particular. I'd like to establish some documentation or tests on exactly what db.seqInfoDict should do. So far I know that it should contain info objects with a length attribute, for NLMSAs to use; and an id and sequence, presumably. What else? Ultimately it'd be good to establish a simple executable test for "proper" seqInfoDict behavior. Specific issues --------------- I don't like some aspects of the 'ifile' handling in SequenceFileDB. In particular, I dislike the part where it re-opens the file to re-read it and then closes 'ifile'; this may not be expected behavior ('ifile' itself is only used to grab the filename!) and I don't like closing user-passed-in file handles. Plus, pulling the filename out of a file handle disallows StringIO and other "file-like" handles such as sockets... Is the ifile mechanism actually used anywhere? The 'set_seqtype' function on SequenceDB depends on things being a FASTA file instead of just iterating. Why?? Can we remove that? Also, 'set_seqtype' returns the sequence type as well as setting it on the db object. This behavior isn't used anywhere in the code base. Can we trash it? In fact, it seems like 'set_seqtype' is used only within SequenceDB and its children. Why not rename it to be private, i.e. prepend it with '_'? Should SequenceDB._cacheMax be configurable? Right now it's hardcoded. I don't know what idFilter does in _store_seqlen_dict, and it seems like it would be easier to just implement that in your own seqfmt-based reader, anyway. Can we trash it? It's not used anywhere in the code base. The 'Sequence' classes in seqdb (e.g. FileDBSequence) should never be created directly by a user, right? Should we make them private by prepending a '_'? SeqPrefixUnionDict is oddly named. PrefixUnionDict could contain __iadd__ behavior (instead of SeqPrefixUnionDict, which inherits from it solely to add __iadd__ behavior). Why not do things this way? There's an 'if' statement in SeqPrefixUnionDict.__iadd__ that can never be reached, as far as I can tell. Could someone (Chris) please verify that for me? In two places (SeqPrefixUnionDict.__iadd__, find @CTB untested; and _PrefixUnionDictInverse.__getitem__, find @CTB abstraction) explicit knowledge of annotation objects is assumed in classes that otherwise don't know anything special about them. Any way to deal with this better or should I just suck up and deal & write some tests? A few PrefixUnionDict issues: - PrefixUnionDict.__contains__ can take either a sequence ID or a sequence object. This is inconsistent with __getitem__ (which only takes seq IDs). Ugly. - PrefixUnionDict.__init__ takes an argument 'trypath' that is really a list of paths to try. Could we rename this to 'trypaths'? Or should I just suck up and deal & write a nice doc? - PrefixUnionDict.writeHeaderFile function (etc.) isn't used anywhere in the code. How useful is it? - PrefixUnionDict.newMemberDict(...) and the associated _PrefixUnionDictMember class don't have a clear purpose to me. Could someone give me a concrete use case? If not, maybe they should be removed or something; they're confusing! - _PrefixUnionDictMember.default handling could probably be written in a more standard way using dict.setdefault, no? How much code depends on this specific behavior? The SliceDB, VirtualSeq, and VirtualSeqDB classes are untested and unused in the rest of the code base. I'd appreciate some guidance on what they do, and perhaps some tests to get started. (@CTB issue) The SQLSequence stuff at the bottom of seqdb.py could probably go in a different file, like sqlgraph.py. They don't depend on anything in seqdb.py. cheers, --titus -- C. Titus Brown, c...@msu.edu --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "pygr-dev" group. To post to this group, send email to pygr-dev@googlegroups.com To unsubscribe from this group, send email to pygr-dev+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/pygr-dev?hl=en -~----------~----~----~----~------~----~------~--~---