On 08/02/2013 07:33 AM, Michael Lawrence wrote:
This has come up several times I think, and I'm totally naive, but what about
making R smart enough to do that itself? One nice thing about the vectorization
is that R knows how many CHARSXPs it needs to allocate up-front, or at least an
upper-bound when they're all unique. Is the issue that the R heap needs to be
initialized in a special way and its behavior can't change at run-time? It's a
tricky thing to get right, I'm sure.

Currently, R allocates space for a STRSXP containing n = 55780092/2 elements, but for all R knows these could all be identical and no CHARSXP's are allocated; I don't think there's a way to pre-allocate n CHARSXPs. As the elements are filled, R sometimes needs to do garbage collection, and it has the increasingly onerous task of checking each CHARSXP to see whether it is in use (I think -- it seems like an in-use STRSXP might somehow be used to short-circuit this?). Maybe it would be helpful to be able to pre-allocate space for n unique CHARSXPs.

It seems like R_GC_MEM_GROW could be made accessible via C calls, but presumably this is not a good thing for user code to be entrusted with.

Martin



On Fri, Aug 2, 2013 at 4:24 AM, Martin Morgan <mtmor...@fhcrc.org
<mailto:mtmor...@fhcrc.org>> wrote:

    On 07/31/2013 04:05 PM, Hervé Pagès wrote:

        Hi Dario,

        Thanks for providing the file.

        The file has 55780092 records in it and yes it only contains primary
        alignments so this is the best situation for the pairing algorithm.

        Here are the timings I get for the readBamGappedAlignmentPairs()
        function (renamed readGAlignmentPairsFromBam() in devel) on a 64-bit
        Ubuntu server with 384 Gb of RAM:

        ## With BioC release (Rsamtools 1.12.3)
        ## ------------------------------__------
        ## Timings:
        ##   - readBamGappedAlignments: 10 min 30 s (A=A1+A2)
        ##     - scanBam:                6 min      (A1)


    I haven't looked at Dario's data directly, but a surprising bottleneck is
    the creation of large character vectors

    $ R --vanilla
    [...]
     >  i = seq_len(55780092/2)
     > system.time(as.character(i))
        user  system elapsed
      74.880   0.408  75.451

    This can be alleviated by telling R that that you're likely to need lots of
    memory up front; I alias R so that

    $ R --min-vsize=2048M --min-nsize=20M --vanilla
    [...]
     > i = seq_len(55780092/2)
     > system.time(as.character(i))
        user  system elapsed
      25.269   0.472  25.796

    or

    $ R_GC_MEM_GROW=3 R --min-vsize=2048M --min-nsize=20M --vanilla
    [...]
     > i = seq_len(55780092/2)
     > system.time(as.character(i))
        user  system elapsed
      12.196   0.464  12.687

    These are documented on ?Memory and with

     > R.version.string
    [1] "R version 3.0.1 Patched (2013-06-05 r62876)"


    Martin


        ##     - other:                  4 min 30 s (A2)
        ##   - makeGappedAlignmentPairs: 5 min 30 s (B=B1+B2)
        ##     - findMateAlignment:      1 min 30 s (B1)
        ##     - other:                  4 min      (B2)
        ##   Total time:                16 min      (A+B)
        ## Memory usage: 13 Gb

        ## With BioC devel (Rsamtools 1.13.19)
        ## ------------------------------__-----
        ## Timings:
        ##   - readGAlignmentsFromBam: 11 min      (A=A1+A2)
        ##     - scanBam:               6 min      (A1)
        ##     - other:                 5 min      (A2)
        ##   - makeGAlignmentPairs:     4 min      (B=B1+B2)
        ##     - findMateAlignment:     1 min      (B1)
        ##     - other:                 3 min      (B2)
        ##   Total time:               15 min      (A+B)
        ## Memory usage: 13 Gb

        Indeed, not much difference between release and devel :-/

        I didn't have the opportunity to do this kind of detailed timing of
        the readGAlignmentPairsFromBam() function on such a big file before,
        but I find it pretty interesting (I only benchmarked findMateAlignment()
        and it was on simulated data).

        In particular I'm quite surprised to discover that the pairing algo
        (implemented in findMateAlignment) is actually not the bottleneck
        (and it being optimized in BioC devel explains the small difference
        between total time in release and total time in devel).

        Just a note on the timings you sent earlier (I'm copying them here
        again):

            > system.time(RNAreads <- readBamGappedAlignmentPairs(__file))
                  user  system elapsed
               1852.16   59.00 1939.11
            > system.time(RNAreads <- readBamGappedAlignments(file))
                  user  system elapsed
                192.80    8.12  214.97

        Subtracting the 2 times above to infer the time spent for the pairing
        is a shortcut that overlooks the following: when
        readBamGappedAlignmentPairs() calls readBamGappedAlignments()
        internally, it calls it with 'use.names=TRUE', and also with a 'param'
        arg that specifies the extra BAM fields, and also with a specific flag
        filter. All those things are required for the pairing algo. So a more
        sensible comparison is to compare:

            RNAreads <- readBamGappedAlignmentPairs(__file)

        versus:

            flag0 <- scanBamFlag(isPaired=TRUE, isUnmappedQuery=FALSE,
        hasUnmappedMate=FALSE)
            what0 <- c("rname", "strand", "pos", "cigar", "flag", "mrnm", 
"mpos")
            param0 <- ScanBamParam(flag=flag0, what=what0)
            RNAreads <- readBamGappedAlignments(file, use.names=TRUE, 
param=param0)

        This internal call to readBamGappedAlignments() takes more than 70%
        of the total time of readBamGappedAlignmentPairs() (11 min / 15 min
        with BioC devel on my Linux server). This is a little bit surprising
        to me and that's something I'd like to investigate.

        Cheers,
        H.


        On 07/30/2013 05:00 PM, Dario Strbenac wrote:

            Hi.

            The files are

            http://www.maths.usyd.edu.au/__u/dario/RNAAlignedSorted.bam
            <http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam>
            http://www.maths.usyd.edu.au/__u/dario/RNAAlignedSorted.bam.__bai
            <http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam.bai>

            While you have them, could you also investigate the granges coercion
            function
            ? It also took half an hour.

            __________________________________________
            From: Hervé Pagès [hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>]
            Sent: Tuesday, 30 July 2013 3:29 AM
            To: Dario Strbenac
            Cc: bioc-devel@r-project.org <mailto:bioc-devel@r-project.org>
            Subject: Re: [Bioc-devel] Running Time of 
readBamGappedAlignmentPairs

            On 07/29/2013 12:00 AM, Dario Strbenac wrote:

                Because I only allowed unique mappings, the ratio is 2. After
                installing the
                development package versions, I only got a 10% improvement.


            Mmmh that's disappointing. Can you put the file somewhere so I can 
have
            a look? Thanks.

            H.


                      user  system elapsed
                1681.36 65.79 1752.10 <tel:65.79%201752.10>

                __________________________________________
                From: Pages, Herve [hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>]
                Sent: Monday, 29 July 2013 3:29 PM
                To: Dario Strbenac
                Cc: bioc-devel@r-project.org <mailto:bioc-devel@r-project.org>
                Subject: Re: [Bioc-devel] Running Time of
                readBamGappedAlignmentPairs

                Hi Dario,

                The function was optimized in Bioc-devel. Depending on the 
number
                of records in your BAM file (which is more relevant than its 
size),
                it should now be between 2x and 5x faster. Please give it a try 
and
                let us know.

                Also keep in mind that the pairing algo will slow down if
                the "average nb of records per unique QNAME" is 3 or more.
                This was discussed here last month:
                
https://stat.ethz.ch/__pipermail/bioconductor/2013-__June/053390.html
                
<https://stat.ethz.ch/pipermail/bioconductor/2013-June/053390.html>

                Cheers,
                H.


                ----- Original Message -----
                From: "Dario Strbenac" <dstr7...@uni.sydney.edu.au
                <mailto:dstr7...@uni.sydney.edu.au>>
                To: bioc-devel@r-project.org <mailto:bioc-devel@r-project.org>
                Sent: Sunday, July 28, 2013 9:00:30 PM
                Subject: [Bioc-devel] Running Time of 
readBamGappedAlignmentPairs

                Hello,

                Could readBamGappedAlignmentPairs be optimised ? It's surprising
                that it
                takes an extra 29 minutes to calculate the pairing information,
                for the
                example below. The BAM file is 4 GB in size.

                    system.time(RNAreads <- readBamGappedAlignmentPairs(__file))

                      user  system elapsed
                1852.16   59.00 1939.11

                    system.time(RNAreads <- readBamGappedAlignments(file))

                      user  system elapsed
                    192.80    8.12  214.97

                ------------------------------__--------
                Dario Strbenac
                PhD Student
                University of Sydney
                Camperdown NSW 2050
                Australia

                _________________________________________________
                Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
                mailing list
                https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                <https://stat.ethz.ch/mailman/listinfo/bioc-devel>


                _________________________________________________
                Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
                mailing list
                https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                <https://stat.ethz.ch/mailman/listinfo/bioc-devel>


            --
            Hervé Pagès

            Program in Computational Biology
            Division of Public Health Sciences
            Fred Hutchinson Cancer Research Center
            1100 Fairview Ave. N, M1-B514
            P.O. Box 19024
            Seattle, WA 98109-1024

            E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>
            Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
            Fax: (206) 667-1319 <tel:%28206%29%20667-1319>





    --
    Computational Biology / Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N.
    PO Box 19024 Seattle, WA 98109

    Location: Arnold Building M1 B861
    Phone: (206) 667-2793 <tel:%28206%29%20667-2793>


    _________________________________________________
    Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list
    https://stat.ethz.ch/mailman/__listinfo/bioc-devel
    <https://stat.ethz.ch/mailman/listinfo/bioc-devel>




--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to