[pygr] Re: BLAST commentary

Christopher Lee Thu, 28 May 2009 12:55:31 -0700

This is great, Titus!  Comments below:

On May 27, 2009, at 9:24 PM, C. Titus Brown wrote:


> First, I dislike this interface:
>
>>>> al = blastmap(None, queryDB=db)
>
> in order to distinguish the database query from the sequence query.

Yup, the queryDB argument was just tacked on without any real  
thought.  Making the first argument (seq) optional makes total sense.

>
> How about making it
>
>>>> al = blastmap[some_sequence]

blastmap, like many other things in Pygr (NLMSA etc.), is a graph  
interface.  i.e. it obeys the following interface:

g[src][dest] = edge

which simply reflects the fact it's a many-to-many mapping.  Changing  
g[src] to return an alignment object (which is also a many-to-many  
mapping) would turn it into the following interface

g[src][src][dest] = edge

which makes it redundant in requiring you to requery with the same src  
node.

>>>>
> OR
>
>>>> al = blastmap(seq=some_sequence)

Sure!

>>>>
>
> OR
>
>>>> al = blastmap(queryDB=db)

Sure!

>>>>
> Second, the handling of self-matches when BLASTing within a single
> database is odd.
>
> a) The warning message that you get when doing
>
>   blastmap = blast.BlastMapping(db)
>   blastmap(None, queryDB=db)
>
> doesn't show up when you pick a sequence out of 'db':
>
>   seq = db['some_sequence']
>   blastmap(seq)
>
> ...even though it would be consistent to have it there.

I just tested this, and I get consistent behavior:

 >>> sp = seqdb.SequenceFileDB('data/sp_all_hbb')
 >>> s = sp['HBB1_MOUSE']
 >>> b = blast.BlastMapping(sp)
 >>> b[s]

WARNING: your query sequence is part of this database.  Pygr alignments
normally do not report self-matches, i.e. the alignment of a sequence  
interval
to itself, so only homologies to OTHER sequences in the database
(or other intervals of the query sequence, if they are homologous)  
will be
reported (or an empty query result if no such homologies are found).
To report ALL homologies, including the self-match, simply create a new
sequence object and use that as your query, e.g.
query = sequence.Sequence(str(seq),"myquery")
results = db.blast(query)

To turn off this message, use the verbose=False option
INFO blast.run_formatdb: Building index: formatdb -i data/sp_all_hbb - 
n data/sp_all_hbb -o T
<pygr.cnestedlist.NLMSASlice object at 0x12ec3f0>
 >>> b(s)

WARNING: your query sequence is part of this database.  Pygr alignments
normally do not report self-matches, i.e. the alignment of a sequence  
interval
to itself, so only homologies to OTHER sequences in the database
(or other intervals of the query sequence, if they are homologous)  
will be
reported (or an empty query result if no such homologies are found).
To report ALL homologies, including the self-match, simply create a new
sequence object and use that as your query, e.g.
query = sequence.Sequence(str(seq),"myquery")
results = db.blast(query)

To turn off this message, use the verbose=False option
<pygr.cnestedlist.NLMSA object at 0x1375648>

If you queried with a sequence that was *not* part of the blast  
database, or you queried with a string, of course you won't get this  
warning.  Is it possible that's what happened?

>
>
> b) I think it's a bad idea to throw away self-matches anyway.  If  
> it's a
> choice between pygr internal consistency ("we don't put self-matches
> into the NLMSA") and pygr consistency with BLAST ("BLAST returns the
> match of seq with itself; why am I not seeing that?") I think we  
> should
> go with the latter.

First I should explain some background about NLMSA: it is a true  
multiple sequence alignment (MSA) storage, which means alignments are  
stored not in pairwise fashion (A-->B, B-->A) but in a "star topology"  
with all sequences aligned against an "alignment coordinate  
system" (for technical reasons called LPO, which stands for  
"linearized partial order", see my POA paper if you want to know  
why).  Every sequence stores a mapping to the LPO; the LPO stores a  
mapping to every sequence.  Therefore the alignment storage  
requirements scale as O(N), where N is the number of sequences aligned  
to each other in a given region, whereas pairwise storage would  
require O(N^2) space.  That is one of the advantages of a true MSA.

Thus, each alignment query is resolved in two steps: A-->L; L-->{set  
of sequences}.  Note that the {set of sequences} of course includes  
A.  Thus the self-alignment A-->A is *always* present, but it makes  
little sense to report this trivial result (a sequence is always  
"aligned" to itself in an MSA).

I think the cognitive dissonance you are encountering in this blast  
case is that blast is actually a pairwise alignment, not an MSA.   
NLMSA has a pairwiseMode, and we do use that for BLAST; I guess it's  
conceivable that we could allow reporting of self-matches when  
pairwiseMode=True.



>
>
> c) Note the duplicates in the doctest output, below:
>
>>>> al = blastmap(None, queryDB=db)
>>>> for seq in db.values():
> ...    for (src, dest, edge) in al[seq].edges():
> ...       print repr(src), repr(dest)
> gapped[0:40] ungapped[0:40]
> gapped[0:40] ungapped[0:40]                          ## DUP
> gapped[44:74] ungapped[40:70]
> gapped[44:74] ungapped[40:70]                          ## DUP
> ungapped[0:40] gapped[0:40]
> ungapped[0:40] gapped[0:40]                          ## DUP
> ungapped[40:70] gapped[44:74]
> ungapped[40:70] gapped[44:74]                          ## DUP
>
> Those are there because adding 'ungapped' -> 'gapped' into an NLMSA
> automatically adds 'gapped' -> 'ungapped', too (i.e. a->b adds b->a).
> Because BLAST isn't a symmetric algorithm -- a matching b doesn't
> necessarily mean b matches a -- this is sort of legitimate; the edge
> info may in fact be different between the "DUP" rows above.  It's  
> still
> confusing, though.

Yes, as the Pygr docs discuss, the proper way of storing BLAST results  
in an NLMSA is to use bidirectional=False.  I just looked at blast.py,  
and it appears we are not using that.  This surprises me, since I took  
the trouble to implement that mode and document how that is the proper  
way to handle blast.  I think we could use the bidirectional=False  
option to eliminate these duplicates.


>
>
> Third, I think the BlastxMapping interface is not yet so useful.   
> Right
> now it returns a list of NLMSASlices, which are hard to use without a
> reasonably complete understanding of NLMSAs.  (At least, it was tricky
> for me to figure out.)  I understand *why* this is, but I came up with
> an alternative.
>
> ...
>
> If, however, I construct two NLMSAs, one an MSA containing the
> translated frames mapped to the proteins, and the other an annotation
> map containing dna mapped to translations, I can do the queries much
> more pygr-ishly.  For example, to query by protein and retrieve the
> aligned DNA sequence, I could do:
>
>   for src, dest, edge in trans2prot[rat_prot].edges():
>       print repr(dna2trans[dest].keys()[0])
>
> or I could query by DNA sequence, and retrieve protein matches:
>
>   for translated_frame in dna2trans[dna_seq]:
>       edges = trans2prot[translated_frame].edges()
>       for (translated_ival, match_ival, edge) in edges:
>           print (translated_ival, match_ival)
>
> Again, fully functioning code to do all of this (including build the
> NLMSAs) is attached.
>
> Does this seem like a reasonable interface to go with?  If so, we  
> could
> change the code to produce six translated frames per DNA sequence (->
> annotation objects), map the protein matches onto *those* sequences,
> and then return both sets of mappings.

This is an interesting idea.  I considered something like the 6-frame  
idea, but rejected it because I thought an "ORF annotation" should  
correspond directly to how people actually annotate ORFs in bacterial  
genomes, i.e. each gene would have a separate ORF annotation.  Under  
the 6-frame model, a bacterial genome would have only 6 ORF  
annotations, and each gene would just be a slice of one of those.  We  
would also have to decide how to translate stop codons (each ORF  
annotation would presumably contain thousands of these) -- maybe just  
as '*'?

Adding the second layer mapping is a very interesting idea -- it  
captures the idea that blastx isn't strictly a nucleotide sequence A -- 
 > protein B mapping, but really nt A --> translation A' --> protein B  
mapping.  Two layers, as you pointed out.  The question is how to make  
the interface clean: logical and simple.

-- Chris

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to pygr-dev@googlegroups.com
To unsubscribe from this group, send email to 
pygr-dev+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[pygr] Re: BLAST commentary

Reply via email to