[pygr] Re: BLAST commentary

C. Titus Brown Thu, 28 May 2009 13:27:49 -0700

On Thu, May 28, 2009 at 12:55:08PM -0700, Christopher Lee wrote:
-> On May 27, 2009, at 9:24 PM, C. Titus Brown wrote:
-> 
-> > First, I dislike this interface:
-> >
-> >>>> al = blastmap(None, queryDB=db)
-> >
-> > in order to distinguish the database query from the sequence query.
-> 
-> Yup, the queryDB argument was just tacked on without any real  
-> thought.  Making the first argument (seq) optional makes total sense.
-> 
-> >
-> > How about making it
-> >
-> >>>> al = blastmap[some_sequence]
-> 
-> blastmap, like many other things in Pygr (NLMSA etc.), is a graph  
-> interface.  i.e. it obeys the following interface:
-> 
-> g[src][dest] = edge
-> 
-> which simply reflects the fact it's a many-to-many mapping.  Changing  
-> g[src] to return an alignment object (which is also a many-to-many  
-> mapping) would turn it into the following interface
-> 
-> g[src][src][dest] = edge
-> 
-> which makes it redundant in requiring you to requery with the same src  
-> node.


Oh, whups, of course you're right.

-> >>>> al = blastmap(seq=some_sequence)
-> 
-> Sure!
-> 
-> >>>>
-> >
-> > OR
-> >
-> >>>> al = blastmap(queryDB=db)
-> 
-> Sure!

OK -- shall I make the appropriate changes?

-> > Second, the handling of self-matches when BLASTing within a single
-> > database is odd.
-> >
-> > a) The warning message that you get when doing
-> >
-> >   blastmap = blast.BlastMapping(db)
-> >   blastmap(None, queryDB=db)
-> >
-> > doesn't show up when you pick a sequence out of 'db':
-> >
-> >   seq = db['some_sequence']
-> >   blastmap(seq)
-> >
-> > ...even though it would be consistent to have it there.
-> 
-> I just tested this, and I get consistent behavior:
-> 
->  >>> sp = seqdb.SequenceFileDB('data/sp_all_hbb')
->  >>> s = sp['HBB1_MOUSE']
->  >>> b = blast.BlastMapping(sp)
->  >>> b[s]

[ ... ]

-> If you queried with a sequence that was *not* part of the blast  
-> database, or you queried with a string, of course you won't get this  
-> warning.  Is it possible that's what happened?

Hm, I didn't recheck the behavior since Marek's blast_test branch was
merged.  Maybe it got updated then, or perhaps I was mistaken ;)

-> > b) I think it's a bad idea to throw away self-matches anyway.  If  
-> > it's a
-> > choice between pygr internal consistency ("we don't put self-matches
-> > into the NLMSA") and pygr consistency with BLAST ("BLAST returns the
-> > match of seq with itself; why am I not seeing that?") I think we  
-> > should
-> > go with the latter.

[ ... ]

-> I think the cognitive dissonance you are encountering in this blast  
-> case is that blast is actually a pairwise alignment, not an MSA.   
-> NLMSA has a pairwiseMode, and we do use that for BLAST; I guess it's  
-> conceivable that we could allow reporting of self-matches when  
-> pairwiseMode=True.

OK, I see.  Hmm.  I guess the options are,

 - keep it as it is
 - allow reporting of self-matches for pairwiseMode=True
 - add a different keyword argument for reporting of self-matches

I guess I would prefer #3, or #1, over #2 -- #3 is more explicit, #1 is
document-able, while #2 changes existing behavior.

-> > c) Note the duplicates in the doctest output, below:
-> >
-> >>>> al = blastmap(None, queryDB=db)
-> >>>> for seq in db.values():
-> > ...    for (src, dest, edge) in al[seq].edges():
-> > ...       print repr(src), repr(dest)
-> > gapped[0:40] ungapped[0:40]
-> > gapped[0:40] ungapped[0:40]                          ## DUP

[ ... munch ... ]

-> Yes, as the Pygr docs discuss, the proper way of storing BLAST results  
-> in an NLMSA is to use bidirectional=False.  I just looked at blast.py,  
-> and it appears we are not using that.  This surprises me, since I took  
-> the trouble to implement that mode and document how that is the proper  
-> way to handle blast.  I think we could use the bidirectional=False  
-> option to eliminate these duplicates.

OK.  Shall I patch?

-> > Third, I think the BlastxMapping interface is not yet so useful.   
-> > Right
-> > now it returns a list of NLMSASlices, which are hard to use without a
-> > reasonably complete understanding of NLMSAs.  (At least, it was tricky
-> > for me to figure out.)  I understand *why* this is, but I came up with
-> > an alternative.
-> >
-> > ...
-> >
-> > If, however, I construct two NLMSAs, one an MSA containing the
-> > translated frames mapped to the proteins, and the other an annotation
-> > map containing dna mapped to translations, I can do the queries much
-> > more pygr-ishly.  For example, to query by protein and retrieve the
-> > aligned DNA sequence, I could do:
-> >
-> >   for src, dest, edge in trans2prot[rat_prot].edges():
-> >       print repr(dna2trans[dest].keys()[0])
-> >
-> > or I could query by DNA sequence, and retrieve protein matches:
-> >
-> >   for translated_frame in dna2trans[dna_seq]:
-> >       edges = trans2prot[translated_frame].edges()
-> >       for (translated_ival, match_ival, edge) in edges:
-> >           print (translated_ival, match_ival)
-> >
-> > Again, fully functioning code to do all of this (including build the
-> > NLMSAs) is attached.
-> >
-> > Does this seem like a reasonable interface to go with?  If so, we  
-> > could
-> > change the code to produce six translated frames per DNA sequence (->
-> > annotation objects), map the protein matches onto *those* sequences,
-> > and then return both sets of mappings.
-> 
-> This is an interesting idea.  I considered something like the 6-frame  
-> idea, but rejected it because I thought an "ORF annotation" should  
-> correspond directly to how people actually annotate ORFs in bacterial  
-> genomes, i.e. each gene would have a separate ORF annotation.  Under  
-> the 6-frame model, a bacterial genome would have only 6 ORF  
-> annotations, and each gene would just be a slice of one of those.  We  
-> would also have to decide how to translate stop codons (each ORF  
-> annotation would presumably contain thousands of these) -- maybe just  
-> as '*'?
-> 
-> Adding the second layer mapping is a very interesting idea -- it  
-> captures the idea that blastx isn't strictly a nucleotide sequence A -- 
->  > protein B mapping, but really nt A --> translation A' --> protein B  
-> mapping.  Two layers, as you pointed out.  The question is how to make  
-> the interface clean: logical and simple.

OK, sounds like you like the basic idea, great!  The only problem I
clearly foresee is how to allow the passing-in of prebuilt NLMSAs (e.g.
for people who want to save the results to disk)... any thoughts?

Do you want to work on a prototype, or shall I?  I should be able to
rough something out over the next few days.  On the other hand I do
enjoy critiquing your code ;) ;).

cheers,
--titus
-- 
C. Titus Brown, c...@msu.edu

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to pygr-dev@googlegroups.com
To unsubscribe from this group, send email to 
pygr-dev+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[pygr] Re: BLAST commentary

Reply via email to