Hi,
no one has asked about this previously, but in light of Peter Clarke's  
use of pygr.parse_blast for tblastn I thought I would post details  
about NLMSA's current lack of support for and tblastn / blastx.  I  
added this to the tracker as issue #44.

-- Chris

-----------------
Right now pygr NLMSA is restricted to 1:1 alignment relations, which  
works fine for blastn and blastp, but not tblastn (protein query vs.  
nucleotide database translated to protein sequence) or blastx  
(nucleotide query vs. protein database).


tblastn and blastx are problematic for several reasons:
- the returned alignment is not of the actual query sequence and  
database sequences, but instead of a *translation* (possibly after  
reverse-complementing!) of one side or the other.  Thus the alignment  
results are NOT in the coordinate system of the query and the database  
seqs; instead they involve a new coordinate system (a translation)  
created on the fly.

- this involves a 3:1 alignment relation between nucleotide vs.  
protein sequence.  This is problematic in all sorts of ways, the most  
fundamental of which is how to robustly represent the reading frame  
"phase" for any given part of the alignment (i.e. the ability to  
represent alignment to a "partial codon", which can easily occur when  
aligning protein against exons which may split a single codon across  
an exon-exon junction.

- I think tblastn/blastx imply the need a separate coordinate system  
for this nucleotide vs. protein alignment problem.  For example, what  
if the query is a nucleotide sequence and finds a reverse-complement  
homology to a protein sequence?  I.e. when the query is reverse- 
complemented, it has a translated-homology to the protein sequence.   
The result of any alignment query must always be returned in the same  
orientation as the user-supplied query, which means that the  
homologous protein interval must be returned in "negative orientation"  
-- which of course does not exist for a true protein sequence.

POSSIBLE SOLUTIONS:

I think this would be easy to resolve by using an annotation to  
represent the open reading frame on the protein sequence. The key idea  
is that an annotation is an independent coordinate system, but can be  
converted to the corresponding sequence interval by requesting its  
sequence attribute.  So we could have tblastn return 1:1 alignments of  
nucleotide sequence to an ORF annotation (whose coordinate system  
would be expressed in bp, not aa).  The user would request its  
sequence attribute to obtain the corresponding protein sequence  
interval.  This would work well in both directions (i.e. tblastn, and  
blastx).

The ORF annotation idea solves the "intermediate coordinate system"  
problem nicely: it is a nucleotide coordinate system (which can  
correctly represent either orientation).  But it is bound to the  
protein sequence that it represents, and you can always convert a  
slice of an ORF annotation to the corresponding slice of protein  
sequence by simply accessing its "sequence" attribute.  We could even  
map such ORF annotations directly onto genomic sequence.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to