David,
 
Just curious are you using  latest version of TopHat ( 1.2.0 released on 
01/18/11). I was in contact with TopHat team about similar issue of paired end 
reads. They suggested that this concern has been addressed in new version. 
I have to confirm the claim with my run.
 
Vasu Punj 

--- On Wed, 2/23/11, Jeremy Goecks <jeremy.goe...@emory.edu> wrote:


From: Jeremy Goecks <jeremy.goe...@emory.edu>
Subject: Re: [galaxy-user] RNA seq analysis
To: "David Matthews" <d.a.matth...@bristol.ac.uk>
Cc: galaxy-u...@bx.psu.edu
Date: Wednesday, February 23, 2011, 10:05 PM


Hi David, 


This is a really interesting workflow. My comments:


(1) I encourage you to start a discussion about this idea on seqanswers.com; 
you'll reach more people and may have a better discussion there. Ideally, 
you'll get a Tophat developer to chime in on what I perceive to be the main 
issue, which is:





This may seem similar to setting tophat to ignore non-unique reads. However, it 
is not. This approach gives you 10-15% more reads. I think it is because if 
tophat finds (for example) that the forward read maps to one site but the 
reverse read maps to two sites it throws away the whole read.

Remember that Tophat uses Bowtie to map reads, so it would make sense to look 
carefully at the Bowtie documentation to see how it handles paired-end reads. I 
can't find anything that directly addresses your issue. The other thing to 
consider is how Tophat maps reads -- it breaks them up in order to find splice 
junctions -- and so I'm not sure that Tophat/Bowtie is really mapping paired 
reads; it may be doing some hybrid single/paired-end mapping. Also, at one 
time, you could specify Bowtie parameters when running Tophat, but I don't see 
that option anymore.


(2) It would be interesting to know whether you get qualitatively different 
results via Cufflinks (or another transcriptome analysis software package) 
using your method vs. just using Tophat w/ and w/o ignoring non-unique reads. A 
skeptical view of your workflow would note that (a) multi-mapping reads may be 
legitimate and should not be filtered out and (b) Cufflinks/compare/diff 
assembly and quantitation may smooth out stray reads enough so that your method 
isn't necessary.


Thanks for the interesting post,
J.



On Feb 23, 2011, at 9:41 AM, David Matthews wrote:


Hi Jeremy, 


I thought I'd write to get a discussion of a workflow for people doing RNA seq 
that I have found very useful and addresses some issues in mapping mRNA derived 
RNA-seq paired end data to the genome using tophat. Here is the approach I use 
(I have a human mRNA sample deep sequenced with a 56bp paired end read on an 
illumina generating 29 million reads):


1. Align to hg19 (in my case) using tophat and allowing up to 40 hits for each 
sequence read
2. In samtools filter for "read is unmapped", "mate is mapped" and "mate is 
mapped in a proper pair"
3. Use "group" to group the filtered sam file on c1 (which is the 
"bio-sequencer" read number) and set an operation to count on c1 as well. This 
provides a list of the reads and how many times they map to the human genome, 
because you have filtered the set for reads that have a mate pair there will be 
an even number for each read. For most of the reads the number will be 2 
(indicating the forward read maps once and the reverse read maps once and in a 
proper pair) but for reads that map ambiguously the number will be multiples of 
2. If you count these up I find that 18 million reads map once, 1.3 million map 
twice, 400,000 reads map 3 times and so on until you get down to 1 read mapping 
30 times, 1 read mapping 31 times and so on...
4. Filter the reads to remove any reads that map more than 2 times.
5. Use "compare two datasets" to compare your new list of reads that map only 
twice to pull out all the reads in your sam file that only map twice (i.e. the 
mate pairs).
6. You'll need to sort the sam file before you can use it with other 
applications like IGV.


What you end up with is a sam file where all the reads map to one site only and 
all the reads map as a proper pair. This may seem similar to setting tophat to 
ignore non-unique reads. However, it is not. This approach gives you 10-15% 
more reads. I think it is because if tophat finds (for example) that the 
forward read maps to one site but the reverse read maps to two sites it throws 
away the whole read. By filtering the sam file to restrict it to only those 
mappings that make sense you increase the number of unique reads by getting rid 
of irrational mappings.


Has anyone else found this? Does this make sense to anyone else? Am I making a 
huge mistake somewhere?


A nice aspect of this (or at least I think so!) is that by filtering in this 
manner you can also create a sam file of non-unique mappings which you can 
monitor. This can be useful if one or more genes has a problem of generating a 
lot of non-unique maps which may give problems accurately estimating its 
expression. Also, you also get a list of how many multi hits you have in your 
data so you know the scale of the problem.






Best Wishes,
David.


__________________________________
Dr David A. Matthews


Senior Lecturer in Virology
Room E49
Department of Cellular and Molecular Medicine,
School of Medical Sciences
University Walk,
University of Bristol
Bristol.
BS8 1TD
U.K.


Tel. +44 117 3312058
Fax. +44 117 3312091


d.a.matth...@bristol.ac.uk







-----Inline Attachment Follows-----


_______________________________________________
The Galaxy User list should be used for the discussion
of Galaxy analysis and other features on the public
server at usegalaxy.org. For discussion of local Galaxy
instances and the Galaxy source code, please use the
Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other
Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


      
_______________________________________________
The Galaxy User list should be used for the discussion
of Galaxy analysis and other features on the public
server at usegalaxy.org. For discussion of local Galaxy
instances and the Galaxy source code, please use the
Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other
Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Reply via email to