Re: [Bioc-devel] C++ parallel computing

Martin Morgan Wed, 26 May 2021 04:47:43 -0700

The best way to process large files is in chunks using BamFile(…, yieldSize = 
…) and by using ScanBamParam() to select just the components of the bam files 
of interest. The number of cores is basically irrelevant for input -- you'll be 
using just one, so choose yieldSize to use modest amounts of memory for primary 
data, e.g, 4 GB per file, and process each file separately.


Figure the iteration-by-chunk solution for one file; the simplest example is in 
?Rsamtools::BamFile

     
     ## Use 'yieldSize' to iterate through a file in chunks.
     bf <- open(BamFile(fl, yieldSize=1000)) 
     while (nrec <- length(scanBam(bf)[[1]][[1]]))
         cat("records:", nrec, "\n")
     close(bf)

but you'd likely want the convenience of GenomicAlignments::readGAlignments() / 
readGAlignmentPairs().

Once this is working, write this as a proper function, specifying all packages 
required for the function to complete, e.g.,

fun = function(fl, yieldSize) {
    library(Rsamtools)
    nrec <- 0L
    bf <- open(BamFile(fl, yieldSize=yieldSize))
    repeat {
        len <- length(scanBam(bf)[[1]][[1]])
        if (len == 0L)
            break
        nrec = nrec + len
    }
    close(bf)
    nrec
}

try to minimize the size of the inputs (here just the file name) and the 
outputs (nrec, a single integer), perhaps using the file system to temporarily 
store large results. Use BiocParallel::bplapply to apply this to all files

    bplapply(fls, fun, yieldSize = 1000000)

I would actually recommend BiocParallel::SnowParam() (separate processes) 
because (a) this enforces the discipline that the function does not rely 
implicitly on the state of the parent process and (b) ensures operation across 
all OS, and easier transition to, e.g., an HPC cluster. The fixed cost of 
starting separate processes for each file are outweighed by the time spent 
processing the file in the process.

GenomicFiles::reduceByYield() or reduceByFile() might be relevant.

I am not totally current (others on this list probably know more) but I don't 
think openMP is supported on MacOS (https://mac.r-project.org/openmp/) so would 
be a poor choice at the C level if cross-platform utility were important. If it 
were me, and again I do not have enough recent experience, I might aim for 
Intel Threaded Building Blocks, using RcppParallel for inspiration.

Martin

From: Oleksii Nikolaienko <oleksii.nikolaie...@gmail.com>
Date: Tuesday, May 25, 2021 at 6:28 PM
To: Martin Morgan <mtmorgan.b...@gmail.com>
Cc: "bioc-devel@r-project.org" <bioc-devel@r-project.org>
Subject: Re: [Bioc-devel] C++ parallel computing

Hi Martin,
thanks for your answer. The goal is to speed up my package (epialleleR), where 
most of the functions are already written in C++, but the code is 
single-threaded. Tasks include: apply analog of 
GenomicAlignments::sequenceLayer to SEQ, QUAL and XM strings, calculate 
per-read methylation beta values, create methylation cytosine reports with 
prefiltering of sequence reads. Probably all of them I could parallelize at the 
level of R, but even in this case I'd maybe like to use OpenMP SIMD directives.
And yes, the plan is to use Rhtslib. Current backend for reading BAM is 
Rsamtools, however I believe I could speed things up significantly by avoiding 
unnecessary type conversions and cutting other corners. It doesn't hurt much 
when the BAM file is smaller than 1GB, but for 20-40GB file loading takes more 
than an hour (24 cores, 378GB RAM workstation).

Best,
Oleksii


On Tue, 25 May 2021 at 19:39, Martin Morgan <mailto:mtmorgan.b...@gmail.com> 
wrote:
If the BAM files are each processed independently, and each processing task 
takes a while, then it is probably 'good enough' to use R-level parallel 
evaluation using BiocParallel (currently the recommendation for Bioconductor 
packages) or other evaluation framework. Also, presumably you will use Rhtslib, 
which provides C-level access to the hts library. This will requiring writing C 
/ C++ code to interface between R and the hts library, and will of course be a 
significant underataking.

It might be worth outlining in a bit more detail what your task is and how (not 
too much detail!) you've tried to implement this in Rsamtools.

Martin Morgan

On 5/24/21, 10:01 AM, "Bioc-devel on behalf of Oleksii Nikolaienko" 
<mailto:bioc-devel-boun...@r-project.org on behalf of 
mailto:oleksii.nikolaie...@gmail.com> wrote:

    Dear Bioc team,
    I'd like to ask for your advice on the parallelization within a Bioc
    package. Please point me to a better place if this mailing list is not
    appropriate.
    After a bit of thinking I decided that I'd like to parallelize processing
    at the level of C++ code. Would you strongly recommend not to and use an R
    approach instead (e.g. "future")?
    If parallel C++ is ok, what would be the best solution for all major OSs?
    My initial choice was OpenMP, but then it seems that Apple has something
    against it (https://mac.r-project.org/openmp/). My own dev environment is
    mostly Big Sur/ARM64, but I wouldn't want to drop its support anyway.

    (On the actual task: loading and specific processing of very large BAM
    files, ideally significantly faster than by means of Rsamtools as a backend)

    Best,
    Oleksii Nikolaienko

        [[alternative HTML version deleted]]

    _______________________________________________
    mailto:Bioc-devel@r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] C++ parallel computing

Reply via email to