The best way to process large files is in chunks using BamFile(…, yieldSize = …) and by using ScanBamParam() to select just the components of the bam files of interest. The number of cores is basically irrelevant for input -- you'll be using just one, so choose yieldSize to use modest amounts of memory for primary data, e.g, 4 GB per file, and process each file separately.
Figure the iteration-by-chunk solution for one file; the simplest example is in ?Rsamtools::BamFile ## Use 'yieldSize' to iterate through a file in chunks. bf <- open(BamFile(fl, yieldSize=1000)) while (nrec <- length(scanBam(bf)[[1]][[1]])) cat("records:", nrec, "\n") close(bf) but you'd likely want the convenience of GenomicAlignments::readGAlignments() / readGAlignmentPairs(). Once this is working, write this as a proper function, specifying all packages required for the function to complete, e.g., fun = function(fl, yieldSize) { library(Rsamtools) nrec <- 0L bf <- open(BamFile(fl, yieldSize=yieldSize)) repeat { len <- length(scanBam(bf)[[1]][[1]]) if (len == 0L) break nrec = nrec + len } close(bf) nrec } try to minimize the size of the inputs (here just the file name) and the outputs (nrec, a single integer), perhaps using the file system to temporarily store large results. Use BiocParallel::bplapply to apply this to all files bplapply(fls, fun, yieldSize = 1000000) I would actually recommend BiocParallel::SnowParam() (separate processes) because (a) this enforces the discipline that the function does not rely implicitly on the state of the parent process and (b) ensures operation across all OS, and easier transition to, e.g., an HPC cluster. The fixed cost of starting separate processes for each file are outweighed by the time spent processing the file in the process. GenomicFiles::reduceByYield() or reduceByFile() might be relevant. I am not totally current (others on this list probably know more) but I don't think openMP is supported on MacOS (https://mac.r-project.org/openmp/) so would be a poor choice at the C level if cross-platform utility were important. If it were me, and again I do not have enough recent experience, I might aim for Intel Threaded Building Blocks, using RcppParallel for inspiration. Martin From: Oleksii Nikolaienko <oleksii.nikolaie...@gmail.com> Date: Tuesday, May 25, 2021 at 6:28 PM To: Martin Morgan <mtmorgan.b...@gmail.com> Cc: "bioc-devel@r-project.org" <bioc-devel@r-project.org> Subject: Re: [Bioc-devel] C++ parallel computing Hi Martin, thanks for your answer. The goal is to speed up my package (epialleleR), where most of the functions are already written in C++, but the code is single-threaded. Tasks include: apply analog of GenomicAlignments::sequenceLayer to SEQ, QUAL and XM strings, calculate per-read methylation beta values, create methylation cytosine reports with prefiltering of sequence reads. Probably all of them I could parallelize at the level of R, but even in this case I'd maybe like to use OpenMP SIMD directives. And yes, the plan is to use Rhtslib. Current backend for reading BAM is Rsamtools, however I believe I could speed things up significantly by avoiding unnecessary type conversions and cutting other corners. It doesn't hurt much when the BAM file is smaller than 1GB, but for 20-40GB file loading takes more than an hour (24 cores, 378GB RAM workstation). Best, Oleksii On Tue, 25 May 2021 at 19:39, Martin Morgan <mailto:mtmorgan.b...@gmail.com> wrote: If the BAM files are each processed independently, and each processing task takes a while, then it is probably 'good enough' to use R-level parallel evaluation using BiocParallel (currently the recommendation for Bioconductor packages) or other evaluation framework. Also, presumably you will use Rhtslib, which provides C-level access to the hts library. This will requiring writing C / C++ code to interface between R and the hts library, and will of course be a significant underataking. It might be worth outlining in a bit more detail what your task is and how (not too much detail!) you've tried to implement this in Rsamtools. Martin Morgan On 5/24/21, 10:01 AM, "Bioc-devel on behalf of Oleksii Nikolaienko" <mailto:bioc-devel-boun...@r-project.org on behalf of mailto:oleksii.nikolaie...@gmail.com> wrote: Dear Bioc team, I'd like to ask for your advice on the parallelization within a Bioc package. Please point me to a better place if this mailing list is not appropriate. After a bit of thinking I decided that I'd like to parallelize processing at the level of C++ code. Would you strongly recommend not to and use an R approach instead (e.g. "future")? If parallel C++ is ok, what would be the best solution for all major OSs? My initial choice was OpenMP, but then it seems that Apple has something against it (https://mac.r-project.org/openmp/). My own dev environment is mostly Big Sur/ARM64, but I wouldn't want to drop its support anyway. (On the actual task: loading and specific processing of very large BAM files, ideally significantly faster than by means of Rsamtools as a backend) Best, Oleksii Nikolaienko [[alternative HTML version deleted]] _______________________________________________ mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel