Thanks for all your detailed analysis, Titus! Comments below... On Mar 14, 2009, at 6:14 PM, C. Titus Brown wrote:
> I didn't find any more bugs -- darn it ;). Just because you're not paranoid doesn't mean they aren't out to ... whatever. > > > These changes increased the code coverage by about 20% to 85%, which > is > getting pretty respectable! I'm sure Oscar Wilde said something witty and cutting about respectability, but it has slipped my mind. > > The class names in this file still confuse me. Do we have any rules > on how to build names for classes in inheritance hierarcy? e.g. I believe Steve McConnell's books (e.g. Code Complete) cover these naming convention issues. We could look to see what he says. > FileDBSequence and SequenceFileDB -- which one is the sequence db, > which one is the sequence?? I guess I'm just following English, e.g. "sequence database" is a database; vs. "database sequence" is a sequence. If someone has a better convention, please describe it. > > > The invert behavior is confusing (__invert__ points to > classutil.standard_invert which then redirects to self._inverse, > creating it from self._inverseClass if necessary). Can we use > descriptors or something "simpler" here? Please elaborate further; I don't understand what you're suggesting. The Python standard defines __invert__ as the invert operator method name; I follow that standard. How can a descriptor help us here? > > > Have we decided anything about logging, as Istvan established in his > branch? I'd like to get rid of explicit writes to sys.stderr, in > particular. Seems like a worthy goal. Has anyone entered an issue for this in the tracker? I'm not aware that anyone has explicitly proposed this for 0.8. > > > I'd like to establish some documentation or tests on exactly what > db.seqInfoDict should do. So far I know that it should contain info > objects with a length attribute, for NLMSAs to use; and an id and > sequence, presumably. What else? Ultimately it'd be good to > establish a simple executable test for "proper" seqInfoDict behavior. Hmm... I certainly don't envision *requiring* a "sequence" attribute. I made seqInfoDict a standard mechanism for getting information about a specific sequence ID without having to do the full set of operations for instantiating a sequence object (which, depending on the backend implementation, might involve actually retrieving the sequence, and thus could be slow). The only seqInfoDict value attribute currently required in Pygr is the length. However you and Alex Nolley pointed out the value of being able to associate arbitrary attributes with sequence objects (initially, the FASTA description line), so I formulated seqInfoDict as a general-purpose mechanism for such attributes. I suggest that the list of attributes required on all seqInfoDict values be minimal -- only what Pygr core code simply can't live without. (e.g. in the case of NLMSA, it must know the length in order to pack the coordinate system). Of course, users' code can look for other seqInfoDict attributes if they need to. We should just define consistent names, so that different backends will use the same attribute name for a given kind of data (e.g. free-form text description like the FASTA description line could be called "description"). But unless there is a compelling reason why Pygr's core code needs the "description", I don't think we should add a test that asserts the absence of that attribute is a bug. So the executable test you're proposing only needs to look for the "length" attribute. > > > Specific issues > --------------- > > I don't like some aspects of the 'ifile' handling in SequenceFileDB. > In particular, I dislike the part where it re-opens the file to > re-read it and then closes 'ifile'; this may not be expected behavior > ('ifile' itself is only used to grab the filename!) and I don't like > closing user-passed-in file handles. Plus, pulling the filename out > of a file handle disallows StringIO and other "file-like" handles such > as sockets... Is the ifile mechanism actually used anywhere? This was used by some coordinator module functionality that is made obsolete by the much more powerful capabilities of pygr.Data. We can certainly get rid of this ifile mechanism. > > > The 'set_seqtype' function on SequenceDB depends on things being a > FASTA file > instead of just iterating. Why?? Can we remove that? The reason this exists is historical. At one point Namshin reported that opening a sequence database had become very slow, and it was due to the set_seqtype() method invoking the iterator instead of our original read_fasta_one_line(), which I had removed during our seqdb refactoring. The seqdb iterator invoked the Python shelve __iter__(), which turned out to load the entire list of keys into memory before returning the iterator; very slow for solexa data! To get a fix to Namshin quickly, I first restored the use of read_fasta_one_line, which solved the problem. I later was able to solve the shelve problem; because we had previously found btree to be more scalable than the default hash-based shelve, we had created our own subclass of shelve. So I just added a proper __iter__() method to our subclass. Conclusion: we could again get rid of read_fasta_one_line(). > > > Also, 'set_seqtype' returns the sequence type as well as setting it on > the db object. This behavior isn't used anywhere in the code base. > Can > we trash it? Sounds fine to me. > > > In fact, it seems like 'set_seqtype' is used only within SequenceDB > and its children. Why not rename it to be private, i.e. prepend it > with '_'? Fine by me. Actually, this seems like a topic ripe for code review. Python is not that zealous about the whole public vs. private distinction, and I've accordingly been lax about this. We could go through the code and designate all purely internal methods, then prefix them with _ . > > > Should SequenceDB._cacheMax be configurable? Right now it's > hardcoded. Sure, why not. > > > I don't know what idFilter does in _store_seqlen_dict, and it seems > like it would be easier to just implement that in your own > seqfmt-based reader, anyway. Can we trash it? It's not used anywhere > in the code base. idFilter addressed the nasty reality that one major database source (NCBI) abuses their ID field (treating it as a blob into which they can cram as many junk fields as they want... and then blastall returns an ID that does not match the original in the file). idFilter allows you to do any transformations on the input IDs that you want. However, as a result of the seqdb refactoring, this may no longer be needed in the sequence file db class itself. All the code for the NCBI blast ID munging got moved to blast.py and is handled there. We could get rid of idFilter in _store_seqlen_dict. > > > The 'Sequence' classes in seqdb (e.g. FileDBSequence) should never be > created directly by a user, right? Should we make them private by > prepending a '_'? Throughout pygr we allow users to pass an itemClass to a database constructor, so the item classes are public. The _init_subclass pattern tended to modularize all the backend details in the item class, making the database class very generic (works with any itemClass). Based on your advice, I've reversed that policy, so that the backend details are on the database class. But still, I can easily imagine scenarios in which a user would want to subclass FileDBSequence, so I don't see these as private. > > The SliceDB, VirtualSeq, and VirtualSeqDB classes are untested and > unused in the rest of the code base. I'd appreciate some guidance on > what they do, and perhaps some tests to get started. (@CTB issue) I don't think anyone has used these classes in at least a couple of years. We could delete them. > > > The SQLSequence stuff at the bottom of seqdb.py could probably go in a > different file, like sqlgraph.py. They don't depend on anything in > seqdb.py. That seems like a reasonable idea. I'll answer the PrefixUnionDict questions in a separate email. Yours with thanks, Chris --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "pygr-dev" group. To post to this group, send email to pygr-dev@googlegroups.com To unsubscribe from this group, send email to pygr-dev+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/pygr-dev?hl=en -~----------~----~----~----~------~----~------~--~---