Further to my last email, I've published a workflow (Bristol workflow ....)
which does what I described below - hope this helps in understanding what I'm
on about (!).
On 23 Feb 2011, at 14:41, David Matthews wrote:
> Hi Jeremy,
> I thought I'd write to get a discussion of a workflow for people doing RNA
> seq that I have found very useful and addresses some issues in mapping mRNA
> derived RNA-seq paired end data to the genome using tophat. Here is the
> approach I use (I have a human mRNA sample deep sequenced with a 56bp paired
> end read on an illumina generating 29 million reads):
> 1. Align to hg19 (in my case) using tophat and allowing up to 40 hits for
> each sequence read
> 2. In samtools filter for "read is unmapped", "mate is mapped" and "mate is
> mapped in a proper pair"
> 3. Use "group" to group the filtered sam file on c1 (which is the
> "bio-sequencer" read number) and set an operation to count on c1 as well.
> This provides a list of the reads and how many times they map to the human
> genome, because you have filtered the set for reads that have a mate pair
> there will be an even number for each read. For most of the reads the number
> will be 2 (indicating the forward read maps once and the reverse read maps
> once and in a proper pair) but for reads that map ambiguously the number will
> be multiples of 2. If you count these up I find that 18 million reads map
> once, 1.3 million map twice, 400,000 reads map 3 times and so on until you
> get down to 1 read mapping 30 times, 1 read mapping 31 times and so on...
> 4. Filter the reads to remove any reads that map more than 2 times.
> 5. Use "compare two datasets" to compare your new list of reads that map only
> twice to pull out all the reads in your sam file that only map twice (i.e.
> the mate pairs).
> 6. You'll need to sort the sam file before you can use it with other
> applications like IGV.
> What you end up with is a sam file where all the reads map to one site only
> and all the reads map as a proper pair. This may seem similar to setting
> tophat to ignore non-unique reads. However, it is not. This approach gives
> you 10-15% more reads. I think it is because if tophat finds (for example)
> that the forward read maps to one site but the reverse read maps to two sites
> it throws away the whole read. By filtering the sam file to restrict it to
> only those mappings that make sense you increase the number of unique reads
> by getting rid of irrational mappings.
> Has anyone else found this? Does this make sense to anyone else? Am I making
> a huge mistake somewhere?
> A nice aspect of this (or at least I think so!) is that by filtering in this
> manner you can also create a sam file of non-unique mappings which you can
> monitor. This can be useful if one or more genes has a problem of generating
> a lot of non-unique maps which may give problems accurately estimating its
> expression. Also, you also get a list of how many multi hits you have in your
> data so you know the scale of the problem.
> Best Wishes,
> Dr David A. Matthews
> Senior Lecturer in Virology
> Room E49
> Department of Cellular and Molecular Medicine,
> School of Medical Sciences
> University Walk,
> University of Bristol
> BS8 1TD
> Tel. +44 117 3312058
> Fax. +44 117 3312091
> The Galaxy User list should be used for the discussion
> of Galaxy analysis and other features on the public
> server at usegalaxy.org. For discussion of local Galaxy
> instances and the Galaxy source code, please use the
> Galaxy Development list:
> To manage your subscriptions to this and other
> Galaxy lists, please use the interface at:
The Galaxy User list should be used for the discussion
of Galaxy analysis and other features on the public
server at usegalaxy.org. For discussion of local Galaxy
instances and the Galaxy source code, please use the
Galaxy Development list:
To manage your subscriptions to this and other
Galaxy lists, please use the interface at: