This has come up several times I think, and I'm totally naive, but what
about making R smart enough to do that itself? One nice thing about the
vectorization is that R knows how many CHARSXPs it needs to allocate
up-front, or at least an upper-bound when they're all unique. Is the issue
that the R heap needs to be initialized in a special way and its behavior
can't change at run-time? It's a tricky thing to get right, I'm sure.


On Fri, Aug 2, 2013 at 4:24 AM, Martin Morgan <mtmor...@fhcrc.org> wrote:

> On 07/31/2013 04:05 PM, Hervé Pagès wrote:
>
>> Hi Dario,
>>
>> Thanks for providing the file.
>>
>> The file has 55780092 records in it and yes it only contains primary
>> alignments so this is the best situation for the pairing algorithm.
>>
>> Here are the timings I get for the readBamGappedAlignmentPairs()
>> function (renamed readGAlignmentPairsFromBam() in devel) on a 64-bit
>> Ubuntu server with 384 Gb of RAM:
>>
>> ## With BioC release (Rsamtools 1.12.3)
>> ## ------------------------------**------
>> ## Timings:
>> ##   - readBamGappedAlignments: 10 min 30 s (A=A1+A2)
>> ##     - scanBam:                6 min      (A1)
>>
>
> I haven't looked at Dario's data directly, but a surprising bottleneck is
> the creation of large character vectors
>
> $ R --vanilla
> [...]
> >  i = seq_len(55780092/2)
> > system.time(as.character(i))
>    user  system elapsed
>  74.880   0.408  75.451
>
> This can be alleviated by telling R that that you're likely to need lots
> of memory up front; I alias R so that
>
> $ R --min-vsize=2048M --min-nsize=20M --vanilla
> [...]
> > i = seq_len(55780092/2)
> > system.time(as.character(i))
>    user  system elapsed
>  25.269   0.472  25.796
>
> or
>
> $ R_GC_MEM_GROW=3 R --min-vsize=2048M --min-nsize=20M --vanilla
> [...]
> > i = seq_len(55780092/2)
> > system.time(as.character(i))
>    user  system elapsed
>  12.196   0.464  12.687
>
> These are documented on ?Memory and with
>
> > R.version.string
> [1] "R version 3.0.1 Patched (2013-06-05 r62876)"
>
>
> Martin
>
>
>  ##     - other:                  4 min 30 s (A2)
>> ##   - makeGappedAlignmentPairs: 5 min 30 s (B=B1+B2)
>> ##     - findMateAlignment:      1 min 30 s (B1)
>> ##     - other:                  4 min      (B2)
>> ##   Total time:                16 min      (A+B)
>> ## Memory usage: 13 Gb
>>
>> ## With BioC devel (Rsamtools 1.13.19)
>> ## ------------------------------**-----
>> ## Timings:
>> ##   - readGAlignmentsFromBam: 11 min      (A=A1+A2)
>> ##     - scanBam:               6 min      (A1)
>> ##     - other:                 5 min      (A2)
>> ##   - makeGAlignmentPairs:     4 min      (B=B1+B2)
>> ##     - findMateAlignment:     1 min      (B1)
>> ##     - other:                 3 min      (B2)
>> ##   Total time:               15 min      (A+B)
>> ## Memory usage: 13 Gb
>>
>> Indeed, not much difference between release and devel :-/
>>
>> I didn't have the opportunity to do this kind of detailed timing of
>> the readGAlignmentPairsFromBam() function on such a big file before,
>> but I find it pretty interesting (I only benchmarked findMateAlignment()
>> and it was on simulated data).
>>
>> In particular I'm quite surprised to discover that the pairing algo
>> (implemented in findMateAlignment) is actually not the bottleneck
>> (and it being optimized in BioC devel explains the small difference
>> between total time in release and total time in devel).
>>
>> Just a note on the timings you sent earlier (I'm copying them here
>> again):
>>
>>    > system.time(RNAreads <- readBamGappedAlignmentPairs(**file))
>>          user  system elapsed
>>       1852.16   59.00 1939.11
>>    > system.time(RNAreads <- readBamGappedAlignments(file))
>>          user  system elapsed
>>        192.80    8.12  214.97
>>
>> Subtracting the 2 times above to infer the time spent for the pairing
>> is a shortcut that overlooks the following: when
>> readBamGappedAlignmentPairs() calls readBamGappedAlignments()
>> internally, it calls it with 'use.names=TRUE', and also with a 'param'
>> arg that specifies the extra BAM fields, and also with a specific flag
>> filter. All those things are required for the pairing algo. So a more
>> sensible comparison is to compare:
>>
>>    RNAreads <- readBamGappedAlignmentPairs(**file)
>>
>> versus:
>>
>>    flag0 <- scanBamFlag(isPaired=TRUE, isUnmappedQuery=FALSE,
>> hasUnmappedMate=FALSE)
>>    what0 <- c("rname", "strand", "pos", "cigar", "flag", "mrnm", "mpos")
>>    param0 <- ScanBamParam(flag=flag0, what=what0)
>>    RNAreads <- readBamGappedAlignments(file, use.names=TRUE, param=param0)
>>
>> This internal call to readBamGappedAlignments() takes more than 70%
>> of the total time of readBamGappedAlignmentPairs() (11 min / 15 min
>> with BioC devel on my Linux server). This is a little bit surprising
>> to me and that's something I'd like to investigate.
>>
>> Cheers,
>> H.
>>
>>
>> On 07/30/2013 05:00 PM, Dario Strbenac wrote:
>>
>>> Hi.
>>>
>>> The files are
>>>
>>> http://www.maths.usyd.edu.au/**u/dario/RNAAlignedSorted.bam<http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam>
>>> http://www.maths.usyd.edu.au/**u/dario/RNAAlignedSorted.bam.**bai<http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam.bai>
>>>
>>> While you have them, could you also investigate the granges coercion
>>> function
>>> ? It also took half an hour.
>>>
>>> ______________________________**__________
>>> From: Hervé Pagès [hpa...@fhcrc.org]
>>> Sent: Tuesday, 30 July 2013 3:29 AM
>>> To: Dario Strbenac
>>> Cc: bioc-devel@r-project.org
>>> Subject: Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs
>>>
>>> On 07/29/2013 12:00 AM, Dario Strbenac wrote:
>>>
>>>> Because I only allowed unique mappings, the ratio is 2. After
>>>> installing the
>>>> development package versions, I only got a 10% improvement.
>>>>
>>>
>>> Mmmh that's disappointing. Can you put the file somewhere so I can have
>>> a look? Thanks.
>>>
>>> H.
>>>
>>>
>>>>      user  system elapsed
>>>> 1681.36   65.79 1752.10
>>>>
>>>> ______________________________**__________
>>>> From: Pages, Herve [hpa...@fhcrc.org]
>>>> Sent: Monday, 29 July 2013 3:29 PM
>>>> To: Dario Strbenac
>>>> Cc: bioc-devel@r-project.org
>>>> Subject: Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs
>>>>
>>>> Hi Dario,
>>>>
>>>> The function was optimized in Bioc-devel. Depending on the number
>>>> of records in your BAM file (which is more relevant than its size),
>>>> it should now be between 2x and 5x faster. Please give it a try and
>>>> let us know.
>>>>
>>>> Also keep in mind that the pairing algo will slow down if
>>>> the "average nb of records per unique QNAME" is 3 or more.
>>>> This was discussed here last month:
>>>>     https://stat.ethz.ch/**pipermail/bioconductor/2013-**
>>>> June/053390.html<https://stat.ethz.ch/pipermail/bioconductor/2013-June/053390.html>
>>>>
>>>> Cheers,
>>>> H.
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: "Dario Strbenac" <dstr7...@uni.sydney.edu.au>
>>>> To: bioc-devel@r-project.org
>>>> Sent: Sunday, July 28, 2013 9:00:30 PM
>>>> Subject: [Bioc-devel] Running Time of readBamGappedAlignmentPairs
>>>>
>>>> Hello,
>>>>
>>>> Could readBamGappedAlignmentPairs be optimised ? It's surprising that it
>>>> takes an extra 29 minutes to calculate the pairing information, for the
>>>> example below. The BAM file is 4 GB in size.
>>>>
>>>>  system.time(RNAreads <- readBamGappedAlignmentPairs(**file))
>>>>>
>>>>      user  system elapsed
>>>> 1852.16   59.00 1939.11
>>>>
>>>>> system.time(RNAreads <- readBamGappedAlignments(file))
>>>>>
>>>>      user  system elapsed
>>>>    192.80    8.12  214.97
>>>>
>>>> ------------------------------**--------
>>>> Dario Strbenac
>>>> PhD Student
>>>> University of Sydney
>>>> Camperdown NSW 2050
>>>> Australia
>>>>
>>>> ______________________________**_________________
>>>> Bioc-devel@r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>
>>>>
>>>> ______________________________**_________________
>>>> Bioc-devel@r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>
>>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpa...@fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>>
>>>
>>>
>>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
>
> ______________________________**_________________
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to