Chris et al., ok, some more BLAST comments. (Sorry for the e-mail deluge -- when working with a new section of the pygr code, I kind of go into a fugue state where I just note down new ideas for later expansion.)
-- First, I dislike this interface: >>> al = blastmap(None, queryDB=db) in order to distinguish the database query from the sequence query. How about making it >>> al = blastmap[some_sequence] OR >>> al = blastmap(seq=some_sequence) OR >>> al = blastmap(queryDB=db) with the first two returning the same thing? I find the conflation of __getitem__ and __call__ behavior a bit odd but not unpleasant ;). On the other hand I dislike *having* to specify a parameter as 'None' in order to get it to pay attention to a different parameter. -- Second, the handling of self-matches when BLASTing within a single database is odd. a) The warning message that you get when doing blastmap = blast.BlastMapping(db) blastmap(None, queryDB=db) doesn't show up when you pick a sequence out of 'db': seq = db['some_sequence'] blastmap(seq) ...even though it would be consistent to have it there. b) I think it's a bad idea to throw away self-matches anyway. If it's a choice between pygr internal consistency ("we don't put self-matches into the NLMSA") and pygr consistency with BLAST ("BLAST returns the match of seq with itself; why am I not seeing that?") I think we should go with the latter. c) Note the duplicates in the doctest output, below: >>> al = blastmap(None, queryDB=db) >>> for seq in db.values(): ... for (src, dest, edge) in al[seq].edges(): ... print repr(src), repr(dest) gapped[0:40] ungapped[0:40] gapped[0:40] ungapped[0:40] ## DUP gapped[44:74] ungapped[40:70] gapped[44:74] ungapped[40:70] ## DUP ungapped[0:40] gapped[0:40] ungapped[0:40] gapped[0:40] ## DUP ungapped[40:70] gapped[44:74] ungapped[40:70] gapped[44:74] ## DUP Those are there because adding 'ungapped' -> 'gapped' into an NLMSA automatically adds 'gapped' -> 'ungapped', too (i.e. a->b adds b->a). Because BLAST isn't a symmetric algorithm -- a matching b doesn't necessarily mean b matches a -- this is sort of legitimate; the edge info may in fact be different between the "DUP" rows above. It's still confusing, though. However, I feel like this is inconsistent with the removal of the self-matches: either we should include everything that BLAST returns, or remove some of the obvious crap from what BLAST returns. If I apply my (b) criterion, I'd go with leaving it all in there and just writing good docs. Thoughts? -- Third, I think the BlastxMapping interface is not yet so useful. Right now it returns a list of NLMSASlices, which are hard to use without a reasonably complete understanding of NLMSAs. (At least, it was tricky for me to figure out.) I understand *why* this is, but I came up with an alternative. (See the attached 'blastfoo.py' for fully functioning code.) OK, let's set up a blastx: dna_db = seqdb.SequenceFileDB('data/hbb1_mouse.fa') dna_seq = dna_db['gi|171854975|dbj|AB364477.1|'] prot_db = seqdb.SequenceFileDB('data/sp_all_hbb') rat_prot = prot_db['HBB2_RAT'] map = blast.BlastxMapping(prot_db) Now, retrieve the matches for the given DNA sequence: dna_matches = map[dna_seq] As things are currently, dna_matches is a list of NLMSASlice objects. If I want to retrieve matches to a particular protein (rat_prot), I need to iteratively search for slices containing that protein: for slice in dna_matches: if rat_prot in slice: edges = slice.edges() print edges If, however, I construct two NLMSAs, one an MSA containing the translated frames mapped to the proteins, and the other an annotation map containing dna mapped to translations, I can do the queries much more pygr-ishly. For example, to query by protein and retrieve the aligned DNA sequence, I could do: for src, dest, edge in trans2prot[rat_prot].edges(): print repr(dna2trans[dest].keys()[0]) or I could query by DNA sequence, and retrieve protein matches: for translated_frame in dna2trans[dna_seq]: edges = trans2prot[translated_frame].edges() for (translated_ival, match_ival, edge) in edges: print (translated_ival, match_ival) Again, fully functioning code to do all of this (including build the NLMSAs) is attached. Does this seem like a reasonable interface to go with? If so, we could change the code to produce six translated frames per DNA sequence (-> annotation objects), map the protein matches onto *those* sequences, and then return both sets of mappings. Specifically, then, you would have something like this: map = blast.BlastxMapping(prot_db) (dna2trans, trans2prot) = map[dna_seq] I think the idea is still a bit half-baked so some critical responses would help me bake it further ;). But it does strike me as being pygr-ish in the sense of chaining graph queries. And the current interface for translated BLASTs is not good, IMO. -- All right, that's it for the night... cheers, --titus --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "pygr-dev" group. To post to this group, send email to pygr-dev@googlegroups.com To unsubscribe from this group, send email to pygr-dev+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/pygr-dev?hl=en -~----------~----~----~----~------~----~------~--~---
from pygr import blast, seqdb, cnestedlist dna_db = seqdb.SequenceFileDB('data/hbb1_mouse.fa') dna_seq = dna_db['gi|171854975|dbj|AB364477.1|'] prot_db = seqdb.SequenceFileDB('data/sp_all_hbb') rat_prot = prot_db['HBB2_RAT'] map = blast.BlastxMapping(prot_db) dna_matches = map[dna_seq] ### # # build an NLMSA containing all of the annotations mapped onto the # proteins. This is particularly useful because we can now query # this NLMSA by the protein sequence. # trans2prot = cnestedlist.NLMSA('foo', mode='memory') for slice in dna_matches: edges = slice.edges() for (src, dest, edge) in edges: trans2prot += src trans2prot[src] += dest trans2prot.build() ### # # also build an NLMSA containing the translations (src -> seq). We can # query this by DNA sequence. # dna2trans = cnestedlist.NLMSA('foo2', mode='memory') for slice in dna_matches: annot = slice.seq seq = annot.sequence dna2trans += seq dna2trans[seq] += annot dna2trans.build() ### # without any NLMSAs, you have to iterate over all of the results # and check each one. for slice in dna_matches: if rat_prot in slice: edges = slice.edges() print edges # with both NLMSAs, you can query by protein and retrieve the aligned # DNA sequence(s) for src, dest, edge in trans2prot[rat_prot].edges(): print repr(dna2trans[dest].keys()[0]) # ...or you can query by DNA sequence and then retrieve translations and # translations-to-protein matches: for translated_frame in dna2trans[dna_seq]: edges = trans2prot[translated_frame].edges() for (translated_ival, match_ival, edge) in edges: print (translated_ival, match_ival)