Hi Brian,

Thanks a bunch. This is a great explanation which I'm sure will help many
who parse MAF files. I should also point out to others that the total size
of the scaffold (23293914 in this case) is actually included in the MAF file
so you don't need to look-up this info elsewhere to perform the computation.


Thanks again,
Jaaved

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Brian
Raney
Sent: Friday, January 28, 2011 5:40 PM
To: Jaaved Mohammed
Cc: [email protected]
Subject: Re: [Genome] Insect MAF file coordinate-sequence disagreement

Hey Jaaved,

Thanks for the very detailed problem report.  The behavior you are
seeing is due to how MAF coordinates are defined.  If an alignment is
on the negative (-) strand, then the coordinate reported is the number
of bases from the *end* of the sequence.  This means that to convert
the coordinates in the MAF file to the coordinates that blat gives
you, or to extract sequence from a FASTA or 2bit file, you need to
subtract the end coordinate of the range, from the total size of the
scaffold.

So,  if your MAF coordinate is scaffold_13337 1466195 1466344, then to
calculate the start coordinate of the region on the positive strand,
you need to take the size of scaffold_13337, which is 23293914, and
subtract the end coordinate of your range from it:

23293914 - 1466344 = 21827570  which is the 0-based coordinate where
you can find the sequence from the MAF block in the FASTA file.
Remember that the browser reports coordinates in a 1-based system, but
the MAF and PSL files are in 0-based coordinates.

We realize that these coordinate systems can be confusing, but using
these systems internally makes it easier to avoid different code for
alignments on the different strands.

I hope this answers your questions.  If you have any follow-up
questions, please respond to this list.

Brian

On Thu, Jan 27, 2011 at 2:15 PM, Jaaved Mohammed <[email protected]> wrote:
> Hello,
>
> I am seeing numerous discrepancy between the sequences and coordinates in
> the insect 15-way multiple-alignment files
> (http://hgdownload.cse.ucsc.edu/goldenPath/dm3/multiz15way/) primarily for
> D. ananassae and D. virilus. Hopefully I can explain this as concisely as
> possible, however, let me know how I can clarify:
>
> For example:
> 1. I extract the MAF block of a popular melanogaster miRNA (dme-mir-289,
> chr3L:13613907-13614035)using the maf_parse program from Adam Siepel's
lab:
> maf_parse -o MAF --start 13613907 --end 13614035 chr3L.maf
> See the attached dme-mir-289.maf output file. Notice that the droAna3
> sequence in this file is
>
"CAGCTCGGGTTTTAGGTTGAGTTTACAGTAAAATAAATATTTAAGTGGAGCCTGCGACTctgctactgccactgc
>
cactgccactgccactgccGCTCGGGGAGTCACTTGAGCGTTTGTTGGCACGTAAAAGACATCATAATTAGCATT"
> and the coordinate is scaffold_13337:1466195-1466344 -.
>
> 2. When I extract the sequence of this coordinate from the droAna3.fa file
> (http://hgdownload.cse.ucsc.edu/goldenPath/droAna3/bigZips/) and reverse
> complement it, I get a completely different sequence than that reported
(see
> the file droAna3_289_maf.fa).
>
> 3. When I blat the sequence reported in the MAF file against droAna3.fa, I
> get a best scoring coordinate that is different from the MAF coordinate.
> That is, scaffold_13337:21827571-21827720 -. No blat reported coordinate
> agrees with the MAF coordinate (see droAna3_blatOutput.txt for all blat
> reported coordinates).  When I extract the sequence of the blat reported
> coordinate and reverse complement it (see droAna3_289_blat.fa), this
> sequence agrees with the MAF reported sequence.
>
> I can furnish additional examples upon request. I really appreciate any
time
> and effort spent helping me explain this bizarre observation.
>
> Thanks,
> Jaaved
>
> --
> Jaaved Mohammed,
> Ph.D. Student of Computational Biology
> Tri-Institutional Training Program in Computational Biology and Medicine
> (Cornell University - Ithaca, Weill Cornell Medical College, and Memorial
> Sloan-Kettering Cancer Center)
>
>
>
> _______________________________________________
> Genome maillist  -  [email protected]
> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>
>


_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Reply via email to