Hello Jianpeng,

The numbers 87 and 93 are the actual maximum lengths of the aligned regions on either side of the junction. If you want to examine your pair-end data statistically, the "NGS: Picard (beta)" tool group has several tool options.


However, examining the track at the gene/transcript level for a few well characterized gene bounds is really the best way to understand how the file describes the data. A browser with your tracks loaded (Trackster or UCSC), the text data files, and the Cufflinks manual/FAQ will likely address most of your questions or at least will be a good orientation. The visual portion of this helps a great deal.
http://cufflinks.cbcb.umd.edu/faq.html

To address the visualization at UCSC, I can point you to their User Guide: http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html and contact mailing list: http://genome.ucsc.edu/contacts.html

Good luck with your project. Please remember to keep questions to the Galaxy team on our mailing lists so that our entire team and community can contribute/benefit,

Thanks!

Jen
Galaxy team

-------- Original Message --------
Subject: RE: [galaxy-dev] Tophat output
Date: Wed, 18 Apr 2012 15:23:43 +0000
From: Xu, Jianpeng <jianpeng...@emory.edu>
To: Jennifer Jackson <j...@bx.psu.edu>

Hi Jennifer,

In the history, I have the splice junction file, and click it to show the display at UCSC main. Then I click display at UCSC main. It will open the USCS Genome Browser. Since this is the first time for me to visualize the splice junction, can you give me more instructions on how to visualize it with UCSC genome browser ?
Thanks,

Jianpeng

On 4/18/12 7:58 AM, Xu, Jianpeng wrote:
Thanks a lot, Jennifer. It is very useful and helpful. I got the result using 
Paired-end reads.  The read length for both ends is 100 bp.

chr20 199821 204701 JUNC00000001 17  - 199821 204701 255,0,0 2  87,93  0,4787

Since the read length is 100 bp, why the 87, 93 are less than 100 ?

Below is a sing end read result:

chr11   60277777        60278396        JUNC00000001    1       +       
60277777        60278396        255,0,0 2       22,28   0,591

Can you explain a little bit more ?

Thanks,

Jianpeng
________________________________________
From: Jennifer Jackson [j...@bx.psu.edu]
Sent: Wednesday, April 18, 2012 2:56 AM
To: Xu, Jianpeng
Cc: galaxy-...@lists.bx.psu.edu
Subject: Re: [galaxy-dev] Tophat output

Hello Jianpeng,

The output files from TopHat are described on the TopHat tool form:

--- quote ---
Outputs

Tophat produces two output files:

      junctions -- A UCSC BED track of junctions reported by TopHat. Each
junction consists of two connected BED blocks, where each block is as
long as the maximal overhang of any read spanning the junction. The
score is the number of alignments spanning the junction.
      accepted_hits -- A list of read alignments in BAM format.

Two other possible outputs, depending on the options you choose, are
insertions and deletions, both of which are in BED format.
-------------

BED format is described in the Galaxy wiki, which includes links to the
UCSC BED format description (they authored the format).
http://wiki.g2.bx.psu.edu/Learn/Datatypes#Bed

Two important rules to remember about BED format:

rule #1: coordinate data is already reported with respect to the (+) strand

rule #2: "start" is defined as the smallest coordinate, "end" is defined
as the largest coordinate, due to the rule #1.

BED files have a 0-based, fully-closed, "start" position in data files,
but in browsers the data will display as 1-based. This means you'll need
to add "1" to any "start" coordinate in a .bed file to locate it in a
display application. The two will not and should not match. The "end"
coordinate is also 0-based, but half-open. This will make it appear to
be 1-based for casual users, so it will match between data files and
display applications.

-------------
Using the first data row as an example and this information, we can tell
that:

chr20 199821 204701 JUNC00000001 17  - 199821 204701 255,0,0 2
87,93  0,4787

   *  column 5 is 'score', or 'number of alignments spanning the
junction'.  In this case, "17" alignments.

   * column 11 is the blockSizes, or 'read maximal overhang' of the
junctions (max alignment length). The first is 87 bases, the second 93
bases.

   * column 12 is the blockStarts, or 'overhang start' of the junctions
(alignment start). The first is 0, the second 4787 bases. I am fairly
certain that the first is always 0 and the second could be interpreted
as the 'intron' length, but someone please correct me if this is wrong!

Some calculations can be done with these numbers with respect to the
overall position of the junction already defined in columns 1,2,3
(chrom, start, end): chr20:199821-204701 (-) that define the location of
the predicted splices, the flanking aligned regions, and the (presumed)
'intron'. This example is a bit tricky because the alignment is on the
(-) strand, but for most uses it is enough to simply calculate backwards
from the end coordinate to the start. (Consider the end the start, and
the start the end). If this sounds confusing, that's because it is! When
you visualize the data the concept will make more sense and it is
definitely worth learning about.

Brief explanation: The first start is 0, which literally means that it
starts at the very beginning of the alignment (0-based), which would be
at position chr20, base 204,701, on the (-) strand. This alignment would
continue for 87 bases, then stop. Then the splice would be present. The
second start is at position (204701 - 4787) = 199914 = chr20, base
199,914, on the (-) strand. This is where the second splice would be
present. This alignment would continue for 93 bases. The places the end
at (199914 - 93) = 199821 = chr20, base 199,821, on the (-) strand.
Which is the same as the reported global junction start position, which
we are considering our "end", because this is a (-) stranded alignment.
And, it all adds up.

Trackster would be a good place to start for "Visualization (use the top
menu bar link). The dataset can also be saved as a regular .bed file and
loaded as a custom track into the UCSC Genome Browser (If the direct
link is not fully configured yet).

Hopefully this helps,

Jen
Galaxy team

On 4/17/12 7:20 AM, Xu, Jianpeng wrote:
Hi,

I have installed local galaxy. I used the Tophat to do the RNA-seq
alignment and got a output file: splice junction in bed format.

I can not understand it clearly. What does the number 17, 14 ... in the
column 5 mean ? What does the 87,93 mean ? What does the 0, 4787 mean ?
Can you explain a little bit to me ? Which tool can be used to view this
file ?

Thanks,

track name=junctions description="TopHat junctions"
chr20      199821        204701  JUNC00000001        17        -        199821  
  204701      255,0,0     2        87,93  0,4787
chr20        204631     205520     JUNC00000002     14        -        204631   
   205520        255,0,0  2        96,87     0,802
chr20      205428      205775     JUNC00000003        9            -            
205428  205775      255,0,0      2            92,91     0,256
chr20  205699     205958    JUNC00000004     15  -  205699        205958  
255,0,0     2  87,92    0,167
chr20        205929     207067     JUNC00000005     31        -        205929   
    207067      255,0,0  2        95,97     0,1041
chr20      206977        207909  JUNC00000006        19        -        206977  
      207909     255,0,0     2        93,97  0,835
chr20      207884     212679        JUNC00000007     15      -      207884      
  212679  255,0,0     2      87,76        0,4719
chr20      207910      218238     JUNC00000008        1            -            
207910      218238    255,0,0      2            61,39     0,10289
chr20        212628    218293        JUNC00000009     28      -      212628     
   218293  255,0,0     2      94,94        0,5571


------------------------------------------------------------------------

This e-mail message (including any attachments) is for the sole use of
the intended recipient(s) and may contain confidential and privileged
information. If the reader of this message is not the intended
recipient, you are hereby notified that any dissemination, distribution
or copying of this message (including any attachments) is strictly
prohibited.

If you have received this message in error, please contact
the sender by reply e-mail message and destroy all copies of the
original message (including attachments).


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

    http://lists.bx.psu.edu/

--
Jennifer Jackson
http://galaxyproject.org


--
Jennifer Jackson
http://galaxyproject.org
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/

Reply via email to