Hi, no one has asked about this previously, but in light of Peter Clarke's use of pygr.parse_blast for tblastn I thought I would post details about NLMSA's current lack of support for and tblastn / blastx. I added this to the tracker as issue #44.
-- Chris ----------------- Right now pygr NLMSA is restricted to 1:1 alignment relations, which works fine for blastn and blastp, but not tblastn (protein query vs. nucleotide database translated to protein sequence) or blastx (nucleotide query vs. protein database). tblastn and blastx are problematic for several reasons: - the returned alignment is not of the actual query sequence and database sequences, but instead of a *translation* (possibly after reverse-complementing!) of one side or the other. Thus the alignment results are NOT in the coordinate system of the query and the database seqs; instead they involve a new coordinate system (a translation) created on the fly. - this involves a 3:1 alignment relation between nucleotide vs. protein sequence. This is problematic in all sorts of ways, the most fundamental of which is how to robustly represent the reading frame "phase" for any given part of the alignment (i.e. the ability to represent alignment to a "partial codon", which can easily occur when aligning protein against exons which may split a single codon across an exon-exon junction. - I think tblastn/blastx imply the need a separate coordinate system for this nucleotide vs. protein alignment problem. For example, what if the query is a nucleotide sequence and finds a reverse-complement homology to a protein sequence? I.e. when the query is reverse- complemented, it has a translated-homology to the protein sequence. The result of any alignment query must always be returned in the same orientation as the user-supplied query, which means that the homologous protein interval must be returned in "negative orientation" -- which of course does not exist for a true protein sequence. POSSIBLE SOLUTIONS: I think this would be easy to resolve by using an annotation to represent the open reading frame on the protein sequence. The key idea is that an annotation is an independent coordinate system, but can be converted to the corresponding sequence interval by requesting its sequence attribute. So we could have tblastn return 1:1 alignments of nucleotide sequence to an ORF annotation (whose coordinate system would be expressed in bp, not aa). The user would request its sequence attribute to obtain the corresponding protein sequence interval. This would work well in both directions (i.e. tblastn, and blastx). The ORF annotation idea solves the "intermediate coordinate system" problem nicely: it is a nucleotide coordinate system (which can correctly represent either orientation). But it is bound to the protein sequence that it represents, and you can always convert a slice of an ORF annotation to the corresponding slice of protein sequence by simply accessing its "sequence" attribute. We could even map such ORF annotations directly onto genomic sequence. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "pygr-dev" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/pygr-dev?hl=en -~----------~----~----~----~------~----~------~--~---
