On Thu, May 28, 2009 at 12:55:08PM -0700, Christopher Lee wrote: -> On May 27, 2009, at 9:24 PM, C. Titus Brown wrote: -> -> > First, I dislike this interface: -> > -> >>>> al = blastmap(None, queryDB=db) -> > -> > in order to distinguish the database query from the sequence query. -> -> Yup, the queryDB argument was just tacked on without any real -> thought. Making the first argument (seq) optional makes total sense. -> -> > -> > How about making it -> > -> >>>> al = blastmap[some_sequence] -> -> blastmap, like many other things in Pygr (NLMSA etc.), is a graph -> interface. i.e. it obeys the following interface: -> -> g[src][dest] = edge -> -> which simply reflects the fact it's a many-to-many mapping. Changing -> g[src] to return an alignment object (which is also a many-to-many -> mapping) would turn it into the following interface -> -> g[src][src][dest] = edge -> -> which makes it redundant in requiring you to requery with the same src -> node.
Oh, whups, of course you're right. -> >>>> al = blastmap(seq=some_sequence) -> -> Sure! -> -> >>>> -> > -> > OR -> > -> >>>> al = blastmap(queryDB=db) -> -> Sure! OK -- shall I make the appropriate changes? -> > Second, the handling of self-matches when BLASTing within a single -> > database is odd. -> > -> > a) The warning message that you get when doing -> > -> > blastmap = blast.BlastMapping(db) -> > blastmap(None, queryDB=db) -> > -> > doesn't show up when you pick a sequence out of 'db': -> > -> > seq = db['some_sequence'] -> > blastmap(seq) -> > -> > ...even though it would be consistent to have it there. -> -> I just tested this, and I get consistent behavior: -> -> >>> sp = seqdb.SequenceFileDB('data/sp_all_hbb') -> >>> s = sp['HBB1_MOUSE'] -> >>> b = blast.BlastMapping(sp) -> >>> b[s] [ ... ] -> If you queried with a sequence that was *not* part of the blast -> database, or you queried with a string, of course you won't get this -> warning. Is it possible that's what happened? Hm, I didn't recheck the behavior since Marek's blast_test branch was merged. Maybe it got updated then, or perhaps I was mistaken ;) -> > b) I think it's a bad idea to throw away self-matches anyway. If -> > it's a -> > choice between pygr internal consistency ("we don't put self-matches -> > into the NLMSA") and pygr consistency with BLAST ("BLAST returns the -> > match of seq with itself; why am I not seeing that?") I think we -> > should -> > go with the latter. [ ... ] -> I think the cognitive dissonance you are encountering in this blast -> case is that blast is actually a pairwise alignment, not an MSA. -> NLMSA has a pairwiseMode, and we do use that for BLAST; I guess it's -> conceivable that we could allow reporting of self-matches when -> pairwiseMode=True. OK, I see. Hmm. I guess the options are, - keep it as it is - allow reporting of self-matches for pairwiseMode=True - add a different keyword argument for reporting of self-matches I guess I would prefer #3, or #1, over #2 -- #3 is more explicit, #1 is document-able, while #2 changes existing behavior. -> > c) Note the duplicates in the doctest output, below: -> > -> >>>> al = blastmap(None, queryDB=db) -> >>>> for seq in db.values(): -> > ... for (src, dest, edge) in al[seq].edges(): -> > ... print repr(src), repr(dest) -> > gapped[0:40] ungapped[0:40] -> > gapped[0:40] ungapped[0:40] ## DUP [ ... munch ... ] -> Yes, as the Pygr docs discuss, the proper way of storing BLAST results -> in an NLMSA is to use bidirectional=False. I just looked at blast.py, -> and it appears we are not using that. This surprises me, since I took -> the trouble to implement that mode and document how that is the proper -> way to handle blast. I think we could use the bidirectional=False -> option to eliminate these duplicates. OK. Shall I patch? -> > Third, I think the BlastxMapping interface is not yet so useful. -> > Right -> > now it returns a list of NLMSASlices, which are hard to use without a -> > reasonably complete understanding of NLMSAs. (At least, it was tricky -> > for me to figure out.) I understand *why* this is, but I came up with -> > an alternative. -> > -> > ... -> > -> > If, however, I construct two NLMSAs, one an MSA containing the -> > translated frames mapped to the proteins, and the other an annotation -> > map containing dna mapped to translations, I can do the queries much -> > more pygr-ishly. For example, to query by protein and retrieve the -> > aligned DNA sequence, I could do: -> > -> > for src, dest, edge in trans2prot[rat_prot].edges(): -> > print repr(dna2trans[dest].keys()[0]) -> > -> > or I could query by DNA sequence, and retrieve protein matches: -> > -> > for translated_frame in dna2trans[dna_seq]: -> > edges = trans2prot[translated_frame].edges() -> > for (translated_ival, match_ival, edge) in edges: -> > print (translated_ival, match_ival) -> > -> > Again, fully functioning code to do all of this (including build the -> > NLMSAs) is attached. -> > -> > Does this seem like a reasonable interface to go with? If so, we -> > could -> > change the code to produce six translated frames per DNA sequence (-> -> > annotation objects), map the protein matches onto *those* sequences, -> > and then return both sets of mappings. -> -> This is an interesting idea. I considered something like the 6-frame -> idea, but rejected it because I thought an "ORF annotation" should -> correspond directly to how people actually annotate ORFs in bacterial -> genomes, i.e. each gene would have a separate ORF annotation. Under -> the 6-frame model, a bacterial genome would have only 6 ORF -> annotations, and each gene would just be a slice of one of those. We -> would also have to decide how to translate stop codons (each ORF -> annotation would presumably contain thousands of these) -- maybe just -> as '*'? -> -> Adding the second layer mapping is a very interesting idea -- it -> captures the idea that blastx isn't strictly a nucleotide sequence A -- -> > protein B mapping, but really nt A --> translation A' --> protein B -> mapping. Two layers, as you pointed out. The question is how to make -> the interface clean: logical and simple. OK, sounds like you like the basic idea, great! The only problem I clearly foresee is how to allow the passing-in of prebuilt NLMSAs (e.g. for people who want to save the results to disk)... any thoughts? Do you want to work on a prototype, or shall I? I should be able to rough something out over the next few days. On the other hand I do enjoy critiquing your code ;) ;). cheers, --titus -- C. Titus Brown, c...@msu.edu --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "pygr-dev" group. To post to this group, send email to pygr-dev@googlegroups.com To unsubscribe from this group, send email to pygr-dev+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/pygr-dev?hl=en -~----------~----~----~----~------~----~------~--~---