Hi Dario,
Thanks for providing the file.
The file has 55780092 records in it and yes it only contains primary
alignments so this is the best situation for the pairing algorithm.
Here are the timings I get for the readBamGappedAlignmentPairs()
function (renamed readGAlignmentPairsFromBam() in devel) on a 64-bit
Ubuntu server with 384 Gb of RAM:
## With BioC release (Rsamtools 1.12.3)
## ------------------------------------
## Timings:
## - readBamGappedAlignments: 10 min 30 s (A=A1+A2)
## - scanBam: 6 min (A1)
## - other: 4 min 30 s (A2)
## - makeGappedAlignmentPairs: 5 min 30 s (B=B1+B2)
## - findMateAlignment: 1 min 30 s (B1)
## - other: 4 min (B2)
## Total time: 16 min (A+B)
## Memory usage: 13 Gb
## With BioC devel (Rsamtools 1.13.19)
## -----------------------------------
## Timings:
## - readGAlignmentsFromBam: 11 min (A=A1+A2)
## - scanBam: 6 min (A1)
## - other: 5 min (A2)
## - makeGAlignmentPairs: 4 min (B=B1+B2)
## - findMateAlignment: 1 min (B1)
## - other: 3 min (B2)
## Total time: 15 min (A+B)
## Memory usage: 13 Gb
Indeed, not much difference between release and devel :-/
I didn't have the opportunity to do this kind of detailed timing of
the readGAlignmentPairsFromBam() function on such a big file before,
but I find it pretty interesting (I only benchmarked findMateAlignment()
and it was on simulated data).
In particular I'm quite surprised to discover that the pairing algo
(implemented in findMateAlignment) is actually not the bottleneck
(and it being optimized in BioC devel explains the small difference
between total time in release and total time in devel).
Just a note on the timings you sent earlier (I'm copying them here
again):
> system.time(RNAreads <- readBamGappedAlignmentPairs(file))
user system elapsed
1852.16 59.00 1939.11
> system.time(RNAreads <- readBamGappedAlignments(file))
user system elapsed
192.80 8.12 214.97
Subtracting the 2 times above to infer the time spent for the pairing
is a shortcut that overlooks the following: when
readBamGappedAlignmentPairs() calls readBamGappedAlignments()
internally, it calls it with 'use.names=TRUE', and also with a 'param'
arg that specifies the extra BAM fields, and also with a specific flag
filter. All those things are required for the pairing algo. So a more
sensible comparison is to compare:
RNAreads <- readBamGappedAlignmentPairs(file)
versus:
flag0 <- scanBamFlag(isPaired=TRUE, isUnmappedQuery=FALSE,
hasUnmappedMate=FALSE)
what0 <- c("rname", "strand", "pos", "cigar", "flag", "mrnm", "mpos")
param0 <- ScanBamParam(flag=flag0, what=what0)
RNAreads <- readBamGappedAlignments(file, use.names=TRUE, param=param0)
This internal call to readBamGappedAlignments() takes more than 70%
of the total time of readBamGappedAlignmentPairs() (11 min / 15 min
with BioC devel on my Linux server). This is a little bit surprising
to me and that's something I'd like to investigate.
Cheers,
H.
On 07/30/2013 05:00 PM, Dario Strbenac wrote:
Hi.
The files are
http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam
http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam.bai
While you have them, could you also investigate the granges coercion function ?
It also took half an hour.
________________________________________
From: Hervé Pagès [hpa...@fhcrc.org]
Sent: Tuesday, 30 July 2013 3:29 AM
To: Dario Strbenac
Cc: bioc-devel@r-project.org
Subject: Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs
On 07/29/2013 12:00 AM, Dario Strbenac wrote:
Because I only allowed unique mappings, the ratio is 2. After installing the
development package versions, I only got a 10% improvement.
Mmmh that's disappointing. Can you put the file somewhere so I can have
a look? Thanks.
H.
user system elapsed
1681.36 65.79 1752.10
________________________________________
From: Pages, Herve [hpa...@fhcrc.org]
Sent: Monday, 29 July 2013 3:29 PM
To: Dario Strbenac
Cc: bioc-devel@r-project.org
Subject: Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs
Hi Dario,
The function was optimized in Bioc-devel. Depending on the number
of records in your BAM file (which is more relevant than its size),
it should now be between 2x and 5x faster. Please give it a try and
let us know.
Also keep in mind that the pairing algo will slow down if
the "average nb of records per unique QNAME" is 3 or more.
This was discussed here last month:
https://stat.ethz.ch/pipermail/bioconductor/2013-June/053390.html
Cheers,
H.
----- Original Message -----
From: "Dario Strbenac" <dstr7...@uni.sydney.edu.au>
To: bioc-devel@r-project.org
Sent: Sunday, July 28, 2013 9:00:30 PM
Subject: [Bioc-devel] Running Time of readBamGappedAlignmentPairs
Hello,
Could readBamGappedAlignmentPairs be optimised ? It's surprising that it takes
an extra 29 minutes to calculate the pairing information, for the example
below. The BAM file is 4 GB in size.
system.time(RNAreads <- readBamGappedAlignmentPairs(file))
user system elapsed
1852.16 59.00 1939.11
system.time(RNAreads <- readBamGappedAlignments(file))
user system elapsed
192.80 8.12 214.97
--------------------------------------
Dario Strbenac
PhD Student
University of Sydney
Camperdown NSW 2050
Australia
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpa...@fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpa...@fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel