Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs

Michael Lawrence Fri, 02 Aug 2013 14:12:19 -0700

On Fri, Aug 2, 2013 at 10:53 AM, Martin Morgan <mtmor...@fhcrc.org> wrote:


> On 08/02/2013 07:33 AM, Michael Lawrence wrote:
>
>> This has come up several times I think, and I'm totally naive, but what
>> about
>> making R smart enough to do that itself? One nice thing about the
>> vectorization
>> is that R knows how many CHARSXPs it needs to allocate up-front, or at
>> least an
>> upper-bound when they're all unique. Is the issue that the R heap needs
>> to be
>> initialized in a special way and its behavior can't change at run-time?
>> It's a
>> tricky thing to get right, I'm sure.
>>
>
> Currently, R allocates space for a STRSXP containing n = 55780092/2
> elements, but for all R knows these could all be identical and no CHARSXP's
> are allocated; I don't think there's a way to pre-allocate n CHARSXPs. As
> the elements are filled, R sometimes needs to do garbage collection, and it
> has the increasingly onerous task of checking each CHARSXP to see whether
> it is in use (I think -- it seems like an in-use STRSXP might somehow be
> used to short-circuit this?). Maybe it would be helpful to be able to
> pre-allocate space for n unique CHARSXPs.
>
> It seems like R_GC_MEM_GROW could be made accessible via C calls, but
> presumably this is not a good thing for user code to be entrusted with.
>
>

I wasn't suggesting to pre-allocate, because that could end up being way
larger than actually necessary, and as you say, there's no direct way to do
it. Since the memory for the string itself is contiguous with the SEXP
header, there's no way to allocate, say, just the headers, even under the
assumption that all strings are unique. The length of every string would
need to be calculated. I was just thinking that maybe the heap growth could
be accelerated (and the GC made less aggressive) when there was a good
chance a lot of memory is going to be needed. Like that weird "overdrive"
button that I've seen in some cars, presumably for quick acceleration in
certain situations. Even though this is a once-per-session optimization, it
would still help a lot, especially when distributing work among many R
sessions.


Martin
>
>
>>
>> On Fri, Aug 2, 2013 at 4:24 AM, Martin Morgan <mtmor...@fhcrc.org
>> <mailto:mtmor...@fhcrc.org>> wrote:
>>
>>     On 07/31/2013 04:05 PM, Hervé Pagès wrote:
>>
>>         Hi Dario,
>>
>>         Thanks for providing the file.
>>
>>         The file has 55780092 records in it and yes it only contains
>> primary
>>         alignments so this is the best situation for the pairing
>> algorithm.
>>
>>         Here are the timings I get for the readBamGappedAlignmentPairs()
>>         function (renamed readGAlignmentPairsFromBam() in devel) on a
>> 64-bit
>>         Ubuntu server with 384 Gb of RAM:
>>
>>         ## With BioC release (Rsamtools 1.12.3)
>>         ## ------------------------------**__------
>>
>>         ## Timings:
>>         ##   - readBamGappedAlignments: 10 min 30 s (A=A1+A2)
>>         ##     - scanBam:                6 min      (A1)
>>
>>
>>     I haven't looked at Dario's data directly, but a surprising
>> bottleneck is
>>     the creation of large character vectors
>>
>>     $ R --vanilla
>>     [...]
>>      >  i = seq_len(55780092/2)
>>      > system.time(as.character(i))
>>         user  system elapsed
>>       74.880   0.408  75.451
>>
>>     This can be alleviated by telling R that that you're likely to need
>> lots of
>>     memory up front; I alias R so that
>>
>>     $ R --min-vsize=2048M --min-nsize=20M --vanilla
>>     [...]
>>      > i = seq_len(55780092/2)
>>      > system.time(as.character(i))
>>         user  system elapsed
>>       25.269   0.472  25.796
>>
>>     or
>>
>>     $ R_GC_MEM_GROW=3 R --min-vsize=2048M --min-nsize=20M --vanilla
>>     [...]
>>      > i = seq_len(55780092/2)
>>      > system.time(as.character(i))
>>         user  system elapsed
>>       12.196   0.464  12.687
>>
>>     These are documented on ?Memory and with
>>
>>      > R.version.string
>>     [1] "R version 3.0.1 Patched (2013-06-05 r62876)"
>>
>>
>>     Martin
>>
>>
>>         ##     - other:                  4 min 30 s (A2)
>>         ##   - makeGappedAlignmentPairs: 5 min 30 s (B=B1+B2)
>>         ##     - findMateAlignment:      1 min 30 s (B1)
>>         ##     - other:                  4 min      (B2)
>>         ##   Total time:                16 min      (A+B)
>>         ## Memory usage: 13 Gb
>>
>>         ## With BioC devel (Rsamtools 1.13.19)
>>         ## ------------------------------**__-----
>>
>>         ## Timings:
>>         ##   - readGAlignmentsFromBam: 11 min      (A=A1+A2)
>>         ##     - scanBam:               6 min      (A1)
>>         ##     - other:                 5 min      (A2)
>>         ##   - makeGAlignmentPairs:     4 min      (B=B1+B2)
>>         ##     - findMateAlignment:     1 min      (B1)
>>         ##     - other:                 3 min      (B2)
>>         ##   Total time:               15 min      (A+B)
>>         ## Memory usage: 13 Gb
>>
>>         Indeed, not much difference between release and devel :-/
>>
>>         I didn't have the opportunity to do this kind of detailed timing
>> of
>>         the readGAlignmentPairsFromBam() function on such a big file
>> before,
>>         but I find it pretty interesting (I only benchmarked
>> findMateAlignment()
>>         and it was on simulated data).
>>
>>         In particular I'm quite surprised to discover that the pairing
>> algo
>>         (implemented in findMateAlignment) is actually not the bottleneck
>>         (and it being optimized in BioC devel explains the small
>> difference
>>         between total time in release and total time in devel).
>>
>>         Just a note on the timings you sent earlier (I'm copying them here
>>         again):
>>
>>             > system.time(RNAreads <- readBamGappedAlignmentPairs(__**
>> file))
>>
>>                   user  system elapsed
>>                1852.16   59.00 1939.11
>>             > system.time(RNAreads <- readBamGappedAlignments(file))
>>                   user  system elapsed
>>                 192.80    8.12  214.97
>>
>>         Subtracting the 2 times above to infer the time spent for the
>> pairing
>>         is a shortcut that overlooks the following: when
>>         readBamGappedAlignmentPairs() calls readBamGappedAlignments()
>>         internally, it calls it with 'use.names=TRUE', and also with a
>> 'param'
>>         arg that specifies the extra BAM fields, and also with a specific
>> flag
>>         filter. All those things are required for the pairing algo. So a
>> more
>>         sensible comparison is to compare:
>>
>>             RNAreads <- readBamGappedAlignmentPairs(__**file)
>>
>>
>>         versus:
>>
>>             flag0 <- scanBamFlag(isPaired=TRUE, isUnmappedQuery=FALSE,
>>         hasUnmappedMate=FALSE)
>>             what0 <- c("rname", "strand", "pos", "cigar", "flag", "mrnm",
>> "mpos")
>>             param0 <- ScanBamParam(flag=flag0, what=what0)
>>             RNAreads <- readBamGappedAlignments(file, use.names=TRUE,
>> param=param0)
>>
>>         This internal call to readBamGappedAlignments() takes more than
>> 70%
>>         of the total time of readBamGappedAlignmentPairs() (11 min / 15
>> min
>>         with BioC devel on my Linux server). This is a little bit
>> surprising
>>         to me and that's something I'd like to investigate.
>>
>>         Cheers,
>>         H.
>>
>>
>>         On 07/30/2013 05:00 PM, Dario Strbenac wrote:
>>
>>             Hi.
>>
>>             The files are
>>
>>             
>> http://www.maths.usyd.edu.au/_**_u/dario/RNAAlignedSorted.bam<http://www.maths.usyd.edu.au/__u/dario/RNAAlignedSorted.bam>
>>             
>> <http://www.maths.usyd.edu.au/**u/dario/RNAAlignedSorted.bam<http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam>
>> >
>>             http://www.maths.usyd.edu.au/_**
>> _u/dario/RNAAlignedSorted.bam.**__bai<http://www.maths.usyd.edu.au/__u/dario/RNAAlignedSorted.bam.__bai>
>>
>>             <http://www.maths.usyd.edu.au/**u/dario/RNAAlignedSorted.bam.
>> **bai <http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam.bai>>
>>
>>             While you have them, could you also investigate the granges
>> coercion
>>             function
>>             ? It also took half an hour.
>>
>>             ______________________________**____________
>>             From: Hervé Pagès [hpa...@fhcrc.org <mailto:hpa...@fhcrc.org
>> >]
>>
>>             Sent: Tuesday, 30 July 2013 3:29 AM
>>             To: Dario Strbenac
>>             Cc: bioc-devel@r-project.org <mailto:bioc-devel@r-project.**
>> org <bioc-devel@r-project.org>>
>>
>>             Subject: Re: [Bioc-devel] Running Time of
>> readBamGappedAlignmentPairs
>>
>>             On 07/29/2013 12:00 AM, Dario Strbenac wrote:
>>
>>                 Because I only allowed unique mappings, the ratio is 2.
>> After
>>                 installing the
>>                 development package versions, I only got a 10%
>> improvement.
>>
>>
>>             Mmmh that's disappointing. Can you put the file somewhere so
>> I can have
>>             a look? Thanks.
>>
>>             H.
>>
>>
>>                       user  system elapsed
>>                 1681.36 65.79 1752.10 <tel:65.79%201752.10>
>>
>>                 ______________________________**____________
>>                 From: Pages, Herve [hpa...@fhcrc.org <mailto:
>> hpa...@fhcrc.org>]
>>
>>                 Sent: Monday, 29 July 2013 3:29 PM
>>                 To: Dario Strbenac
>>                 Cc: bioc-devel@r-project.org <mailto:
>> bioc-devel@r-project.**org <bioc-devel@r-project.org>>
>>
>>                 Subject: Re: [Bioc-devel] Running Time of
>>                 readBamGappedAlignmentPairs
>>
>>                 Hi Dario,
>>
>>                 The function was optimized in Bioc-devel. Depending on
>> the number
>>                 of records in your BAM file (which is more relevant than
>> its size),
>>                 it should now be between 2x and 5x faster. Please give it
>> a try and
>>                 let us know.
>>
>>                 Also keep in mind that the pairing algo will slow down if
>>                 the "average nb of records per unique QNAME" is 3 or more.
>>                 This was discussed here last month:
>>                 https://stat.ethz.ch/__**pipermail/bioconductor/2013-__**
>> June/053390.html<https://stat.ethz.ch/__pipermail/bioconductor/2013-__June/053390.html>
>>
>>                 <https://stat.ethz.ch/**pipermail/bioconductor/2013-**
>> June/053390.html<https://stat.ethz.ch/pipermail/bioconductor/2013-June/053390.html>
>> >
>>
>>                 Cheers,
>>                 H.
>>
>>
>>                 ----- Original Message -----
>>                 From: "Dario Strbenac" <dstr7...@uni.sydney.edu.au
>>                 
>> <mailto:dstr7320@uni.sydney.**edu.au<dstr7...@uni.sydney.edu.au>
>> >>
>>                 To: bioc-devel@r-project.org <mailto:
>> bioc-devel@r-project.**org <bioc-devel@r-project.org>>
>>                 Sent: Sunday, July 28, 2013 9:00:30 PM
>>                 Subject: [Bioc-devel] Running Time of
>> readBamGappedAlignmentPairs
>>
>>                 Hello,
>>
>>                 Could readBamGappedAlignmentPairs be optimised ? It's
>> surprising
>>                 that it
>>                 takes an extra 29 minutes to calculate the pairing
>> information,
>>                 for the
>>                 example below. The BAM file is 4 GB in size.
>>
>>                     system.time(RNAreads <- readBamGappedAlignmentPairs(__
>> **file))
>>
>>
>>                       user  system elapsed
>>                 1852.16   59.00 1939.11
>>
>>                     system.time(RNAreads <- readBamGappedAlignments(file))
>>
>>                       user  system elapsed
>>                     192.80    8.12  214.97
>>
>>                 ------------------------------**__--------
>>
>>                 Dario Strbenac
>>                 PhD Student
>>                 University of Sydney
>>                 Camperdown NSW 2050
>>                 Australia
>>
>>                 ______________________________**___________________
>>                 Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.**
>> org <Bioc-devel@r-project.org>>
>>                 mailing list
>>                 
>> https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel<https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>                 
>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>> >
>>
>>
>>                 ______________________________**___________________
>>                 Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.**
>> org <Bioc-devel@r-project.org>>
>>                 mailing list
>>                 
>> https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel<https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>
>>                 
>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>> >
>>
>>
>>             --
>>             Hervé Pagès
>>
>>             Program in Computational Biology
>>             Division of Public Health Sciences
>>             Fred Hutchinson Cancer Research Center
>>             1100 Fairview Ave. N, M1-B514
>>             P.O. Box 19024
>>             Seattle, WA 98109-1024
>>
>>             E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>
>>             Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>             Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>
>>
>>
>>
>>
>>
>>     --
>>     Computational Biology / Fred Hutchinson Cancer Research Center
>>     1100 Fairview Ave. N.
>>     PO Box 19024 Seattle, WA 98109
>>
>>     Location: Arnold Building M1 B861
>>     Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>>
>>
>>     ______________________________**___________________
>>     Bioc-devel@r-project.org 
>> <mailto:Bioc-devel@r-project.**org<Bioc-devel@r-project.org>>
>> mailing list
>>     
>> https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel<https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>     
>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>> >
>>
>>
>>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs

Reply via email to