On 08/02/2013 07:33 AM, Michael Lawrence wrote:
This has come up several times I think, and I'm totally naive, but what about making R smart enough to do that itself? One nice thing about the vectorization is that R knows how many CHARSXPs it needs to allocate up-front, or at least an upper-bound when they're all unique. Is the issue that the R heap needs to be initialized in a special way and its behavior can't change at run-time? It's a tricky thing to get right, I'm sure.
Currently, R allocates space for a STRSXP containing n = 55780092/2 elements, but for all R knows these could all be identical and no CHARSXP's are allocated; I don't think there's a way to pre-allocate n CHARSXPs. As the elements are filled, R sometimes needs to do garbage collection, and it has the increasingly onerous task of checking each CHARSXP to see whether it is in use (I think -- it seems like an in-use STRSXP might somehow be used to short-circuit this?). Maybe it would be helpful to be able to pre-allocate space for n unique CHARSXPs.
It seems like R_GC_MEM_GROW could be made accessible via C calls, but presumably this is not a good thing for user code to be entrusted with.
Martin
On Fri, Aug 2, 2013 at 4:24 AM, Martin Morgan <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>> wrote: On 07/31/2013 04:05 PM, Hervé Pagès wrote: Hi Dario, Thanks for providing the file. The file has 55780092 records in it and yes it only contains primary alignments so this is the best situation for the pairing algorithm. Here are the timings I get for the readBamGappedAlignmentPairs() function (renamed readGAlignmentPairsFromBam() in devel) on a 64-bit Ubuntu server with 384 Gb of RAM: ## With BioC release (Rsamtools 1.12.3) ## ------------------------------__------ ## Timings: ## - readBamGappedAlignments: 10 min 30 s (A=A1+A2) ## - scanBam: 6 min (A1) I haven't looked at Dario's data directly, but a surprising bottleneck is the creation of large character vectors $ R --vanilla [...] > i = seq_len(55780092/2) > system.time(as.character(i)) user system elapsed 74.880 0.408 75.451 This can be alleviated by telling R that that you're likely to need lots of memory up front; I alias R so that $ R --min-vsize=2048M --min-nsize=20M --vanilla [...] > i = seq_len(55780092/2) > system.time(as.character(i)) user system elapsed 25.269 0.472 25.796 or $ R_GC_MEM_GROW=3 R --min-vsize=2048M --min-nsize=20M --vanilla [...] > i = seq_len(55780092/2) > system.time(as.character(i)) user system elapsed 12.196 0.464 12.687 These are documented on ?Memory and with > R.version.string [1] "R version 3.0.1 Patched (2013-06-05 r62876)" Martin ## - other: 4 min 30 s (A2) ## - makeGappedAlignmentPairs: 5 min 30 s (B=B1+B2) ## - findMateAlignment: 1 min 30 s (B1) ## - other: 4 min (B2) ## Total time: 16 min (A+B) ## Memory usage: 13 Gb ## With BioC devel (Rsamtools 1.13.19) ## ------------------------------__----- ## Timings: ## - readGAlignmentsFromBam: 11 min (A=A1+A2) ## - scanBam: 6 min (A1) ## - other: 5 min (A2) ## - makeGAlignmentPairs: 4 min (B=B1+B2) ## - findMateAlignment: 1 min (B1) ## - other: 3 min (B2) ## Total time: 15 min (A+B) ## Memory usage: 13 Gb Indeed, not much difference between release and devel :-/ I didn't have the opportunity to do this kind of detailed timing of the readGAlignmentPairsFromBam() function on such a big file before, but I find it pretty interesting (I only benchmarked findMateAlignment() and it was on simulated data). In particular I'm quite surprised to discover that the pairing algo (implemented in findMateAlignment) is actually not the bottleneck (and it being optimized in BioC devel explains the small difference between total time in release and total time in devel). Just a note on the timings you sent earlier (I'm copying them here again): > system.time(RNAreads <- readBamGappedAlignmentPairs(__file)) user system elapsed 1852.16 59.00 1939.11 > system.time(RNAreads <- readBamGappedAlignments(file)) user system elapsed 192.80 8.12 214.97 Subtracting the 2 times above to infer the time spent for the pairing is a shortcut that overlooks the following: when readBamGappedAlignmentPairs() calls readBamGappedAlignments() internally, it calls it with 'use.names=TRUE', and also with a 'param' arg that specifies the extra BAM fields, and also with a specific flag filter. All those things are required for the pairing algo. So a more sensible comparison is to compare: RNAreads <- readBamGappedAlignmentPairs(__file) versus: flag0 <- scanBamFlag(isPaired=TRUE, isUnmappedQuery=FALSE, hasUnmappedMate=FALSE) what0 <- c("rname", "strand", "pos", "cigar", "flag", "mrnm", "mpos") param0 <- ScanBamParam(flag=flag0, what=what0) RNAreads <- readBamGappedAlignments(file, use.names=TRUE, param=param0) This internal call to readBamGappedAlignments() takes more than 70% of the total time of readBamGappedAlignmentPairs() (11 min / 15 min with BioC devel on my Linux server). This is a little bit surprising to me and that's something I'd like to investigate. Cheers, H. On 07/30/2013 05:00 PM, Dario Strbenac wrote: Hi. The files are http://www.maths.usyd.edu.au/__u/dario/RNAAlignedSorted.bam <http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam> http://www.maths.usyd.edu.au/__u/dario/RNAAlignedSorted.bam.__bai <http://www.maths.usyd.edu.au/u/dario/RNAAlignedSorted.bam.bai> While you have them, could you also investigate the granges coercion function ? It also took half an hour. __________________________________________ From: Hervé Pagès [hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>] Sent: Tuesday, 30 July 2013 3:29 AM To: Dario Strbenac Cc: bioc-devel@r-project.org <mailto:bioc-devel@r-project.org> Subject: Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs On 07/29/2013 12:00 AM, Dario Strbenac wrote: Because I only allowed unique mappings, the ratio is 2. After installing the development package versions, I only got a 10% improvement. Mmmh that's disappointing. Can you put the file somewhere so I can have a look? Thanks. H. user system elapsed 1681.36 65.79 1752.10 <tel:65.79%201752.10> __________________________________________ From: Pages, Herve [hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>] Sent: Monday, 29 July 2013 3:29 PM To: Dario Strbenac Cc: bioc-devel@r-project.org <mailto:bioc-devel@r-project.org> Subject: Re: [Bioc-devel] Running Time of readBamGappedAlignmentPairs Hi Dario, The function was optimized in Bioc-devel. Depending on the number of records in your BAM file (which is more relevant than its size), it should now be between 2x and 5x faster. Please give it a try and let us know. Also keep in mind that the pairing algo will slow down if the "average nb of records per unique QNAME" is 3 or more. This was discussed here last month: https://stat.ethz.ch/__pipermail/bioconductor/2013-__June/053390.html <https://stat.ethz.ch/pipermail/bioconductor/2013-June/053390.html> Cheers, H. ----- Original Message ----- From: "Dario Strbenac" <dstr7...@uni.sydney.edu.au <mailto:dstr7...@uni.sydney.edu.au>> To: bioc-devel@r-project.org <mailto:bioc-devel@r-project.org> Sent: Sunday, July 28, 2013 9:00:30 PM Subject: [Bioc-devel] Running Time of readBamGappedAlignmentPairs Hello, Could readBamGappedAlignmentPairs be optimised ? It's surprising that it takes an extra 29 minutes to calculate the pairing information, for the example below. The BAM file is 4 GB in size. system.time(RNAreads <- readBamGappedAlignmentPairs(__file)) user system elapsed 1852.16 59.00 1939.11 system.time(RNAreads <- readBamGappedAlignments(file)) user system elapsed 192.80 8.12 214.97 ------------------------------__-------- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia _________________________________________________ Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list https://stat.ethz.ch/mailman/__listinfo/bioc-devel <https://stat.ethz.ch/mailman/listinfo/bioc-devel> _________________________________________________ Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list https://stat.ethz.ch/mailman/__listinfo/bioc-devel <https://stat.ethz.ch/mailman/listinfo/bioc-devel> -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 <tel:%28206%29%20667-2793> _________________________________________________ Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list https://stat.ethz.ch/mailman/__listinfo/bioc-devel <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel