Re: [Bioc-devel] appending 2 GappedAlignments using "c" takes long

Nicolas Delhomme Thu, 11 Jul 2013 02:13:34 -0700

That particular machine has been up  for 40 days although all the parameters 
are in the green right now.


Doing an rbind is as quick as what you reported:
> system.time(df2 <- rbind(df, df))
   user  system elapsed 
  0.104   0.000   0.104


And now I do get the GAlignments warning:

> GappedAlignments()
GAlignments with 0 alignments and 0 metadata columns:
   seqnames strand       cigar    qwidth     start       end     width
      <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
        ngap
   <integer>
  ---
  seqlengths:
   
   
Warning message:
  The GappedAlignments class, the GappedAlignments()
  constructor, and the readGappedAlignments() function, have been
  renamed: GAlignments, GAlignments(), and readGAlignments(),
  respectively. The old names are deprecated. Please use the new 
  names instead.

And the appending works as for you:

> library(Rsamtools)
> library(RNAseqData.HNRNPC.bam.chr14)
> bamfile <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L])
> yieldSize(bamfile) <- 100000L
> open(bamfile)
> out <- GappedAlignments()
Warning message:
  The GappedAlignments class, the GappedAlignments()
  constructor, and the readGappedAlignments() function, have been
  renamed: GAlignments, GAlignments(), and readGAlignments(),
  respectively. The old names are deprecated. Please use the new 
  names instead. 
> chunk <- readBamGappedAlignments(bamfile)
Warning message:
'readBamGappedAlignments' is deprecated.
Use 'readGAlignmentsFromBam' instead.
See help("Deprecated") 

> system.time(out <- append(out, chunk))
   user  system elapsed 
  0.092   0.000   0.091 

> chunk <- readBamGappedAlignments(bamfile)
Warning message:
'readBamGappedAlignments' is deprecated.
Use 'readGAlignmentsFromBam' instead.
See help("Deprecated") 

> system.time(out <- append(out, chunk))
   user  system elapsed 
  0.372   0.000   0.369 

> chunk <- readBamGappedAlignments(bamfile)
Warning message:
'readBamGappedAlignments' is deprecated.
Use 'readGAlignmentsFromBam' instead.
See help("Deprecated") 

> system.time(out <- append(out, chunk))
   user  system elapsed 
  0.896   0.012   0.909 


And the sessionInfo are as before:

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] GenomicRanges_1.13.26 Biostrings_2.29.12    XVector_0.1.0        
[4] IRanges_1.19.15       BiocGenerics_0.7.2   

loaded via a namespace (and not attached):
[1] stats4_3.0.1

So I'm not sure what happened; so far, I can only imagine an NFS / RAID related 
issue.

Doing it with my own data gives the same results as above.

Sorry for bothering you with that and many thanks for the help.

Cheers,

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delho...@embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------





On Jul 11, 2013, at 10:15 AM, Nicolas Delhomme wrote:

> Hej Hervé!
> 
> ---------------------------------------------------------------
> Nicolas Delhomme
> 
> Genome Biology Computational Support
> 
> European Molecular Biology Laboratory
> 
> Tel: +49 6221 387 8310
> Email: nicolas.delho...@embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> ---------------------------------------------------------------
> 
> On Jul 10, 2013, at 8:54 PM, Hervé Pagès wrote:
> 
>> Hi Nico,
>> 
>> On 07/09/2013 08:07 AM, Nicolas Delhomme wrote:
>>> Hej Bioc Core!
>>> 
>>> There was some discussion last year about implementing a BamStreamer (à la 
>>> FastqStreamer), but I haven't seen anything like it in the current devel. 
>>> I've implemented the following function that should do the job for me - I 
>>> have many very large files, and I need to use a cluster with relatively few 
>>> RAM per node and a restrictive time allocation , so I want to parallelize 
>>> the reading of the BAM file to manage both. The example below is obviously 
>>> not affecting the RAM issue but I streamlined it to point out my issue.
>>> 
>>> ".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){
>>> 
>>> ## create a stream
>>> stopifnot(is(bamFile,"BamFile"))
>>> 
>>> ## set the yieldSize if it is not set already
>>> if(is.na(yieldSize(bamFile))){
>>>   yieldSize(bamFile) <- yieldSize
>>> }
>>> 
>>> ## open it
>>> open(bamFile)
>>> 
>>> ## verb
>>> if(verbose){
>>>   message(paste("Streaming",basename(path(bamFile))))
>>> }
>>> 
>>> ## create the output
>>> out <- GappedAlignments()
>>> 
>>> ## process it
>>> while(length(chunk <- readBamGappedAlignments(bamFile))){
>>>   if(verbose){
>>>     message(paste("Processed",length(chunk),"reads"))
>>>   }
>>>   out <- c(out,chunk)
>>> }
>> 
>> Note that regardless the speed of c() on GappedAlignments objects,
>> growing an object in a loop is fundamentally inefficient (see Circle 2
>> of The R Inferno).
>> Also keeping the chunks in memory kind of defeats the purpose of reading
>> the file one chunk at a time.
> 
> Sure. What this function normally really does is a data reduction - basically 
> getting a named vector back. I just came across the appending issue when 
> preparing the code example above.
> 
>> 
>>> 
>>> ## close
>>> close(bamFile)
>>> 
>>> ## return
>>> return(out)
>>> }
>>> 
>>> In the method above, the first iteration of combining the GappedAlignments:
>>> 
>>> out <- c(out,chunk) takes:
>>> 
>>> system.time(append(out,chunk))
>>> 
>>>  user  system elapsed
>>> 123.704   0.060 124.011
>> 
>> 2 minutes! Whaoo, that's really slow. I can't reproduce this on my
>> machine though:
>> 
> 
> OK, sounds more like a system issue then.
> 
>> library(Rsamtools)
>> library(RNAseqData.HNRNPC.bam.chr14)
>> bamfile <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L])
>> yieldSize(bamfile) <- 100000L
>> open(bamfile)
>> out <- GappedAlignments()
>> 
>> Then:
>> 
>>> chunk <- readBamGappedAlignments(bamfile)
>>> system.time(out <- append(out, chunk))
>>    user  system elapsed
>>   0.284   0.000   0.286
>> 
>> I wonder what's going on on your system. Are you sure it was not running
>> out of memory when you did this?
> 
> Yes, that's a fat node with 0.2TB RAM and I was the only one on it at the 
> time.
> 
>> Try to check the load with uptime or
>> top in another terminal (e.g. start top right before you call append()).
>> If the system starts swapping, then your R process will become hundreds
>> or thousands times slower!
> 
> and there was no memory intensive job running. Could still have been some NFS 
> related issue. I will retry with a fresh session and monitor the I/O as well.
> 
>> 
>>> 
>>> whereas the second iteration (faked here) takes only (still long):
>>> 
>>> system.time(append(chunk,chunk))
>>> 
>>>  user  system elapsed
>>> 2.708   0.044   2.758
>> 
>> 2nd, 3rd and 4th iterations for me:
>> 
>>> chunk <- readBamGappedAlignments(bamfile)
>>> system.time(out <- append(out, chunk))
>>    user  system elapsed
>>   0.516   0.004   0.521
>> 
>>> chunk <- readBamGappedAlignments(bamfile)
>>> system.time(out <- append(out, chunk))
>>    user  system elapsed
>>   0.656   0.008   0.663
>> 
>>> chunk <- readBamGappedAlignments(bamfile)
>>> system.time(out <- append(out, chunk))
>>    user  system elapsed
>>   0.796   0.004   0.801
>> 
>> As expected, the time is growing (this is why the process
>> of growing an object in a loop is considered to be quadratic
>> in time).
> 
> Quadratic! Wow, I knew it was slower but still... Good to know.
> 
>> 
>>> 
>>> I suppose this has to do with the way 
>>> GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the 
>>> objects and all the related sanity checks. For the first iteration, the 
>>> seqlengths are different so I suppose that is what explains the 60X lag 
>>> compared to the second iteration.
>> 
>> The seqinfo of the 2 objects to combine need to be merged together
>> and set back on each object before the 2 objects can actually
>> be combined. This operation is cheap and I wouldn't expect this
>> to slow down the first iteration significantly.
> 
> Yes, that was very surprising.
> 
>> 
>>> Due to the implementation of GappedAlignments, I can't set the seqlengths 
>>> programmatically in GappedAlignments() which I imagine would have reduced 
>>> the first iteration lag; see the trials below:
>>> 
>>> out <- GappedAlignments(seqlengths=seqlengths(chunk))
>>> 
>>> Error in GappedAlignments(seqlengths = seqlengths(chunk)) :
>>> 'names(seqlengths)' incompatible with 'levels(seqnames)'
>>> 
>>> out <- 
>>> GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk))
>>> 
>>> Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = 
>>> seqnames(chunk)) :
>>> 'strand' must be specified when 'seqnames' is not empty
>>> 
>>> out <- 
>>> GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+")
>>> 
>>> Error in validObject(.Object) :
>>> invalid class “GappedAlignments” object: 1: invalid object for slot 
>>> "strand" in class "GappedAlignments": got class "character", should be or 
>>> extend class "Rle"
>>> invalid class “GappedAlignments” object: 2: number of rows in DataTable 
>>> 'mcols(x)' must match length of 'x'
>> 
>> The trick is to create an empty GappedAlignments objects
>> with non-empty seqlevels so you can put seqlengths on the
>> seqlevels.
>> 
>> Here are 2 ways to create an empty GappedAlignments objects with
>> non-empty seqlevels:
>> 
>> (1) Pass an empty factor with non-empty levels to the seqnames
>>     arg:
>> 
>>       out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)))
>> 
>> (2) The recommended way:
>> 
>>       out <- GappedAlignments()
>>       seqinfo(out) <- seqinfo(chunk)
>> 
>> Note that with (2), 'out' gets all the seqinfo from 'chunk' (including
>> its seqlengths), not only its seqlevels.
>> 
>> (1) could be adapted to also set the seqlengths:
>> 
>> out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)),
>>                         seqlengths=seqlengths(chunk))
>> 
>> but (2) is really the preferred way.
> 
> Thanks for the pointers! 
> 
>> 
>>> 
>>> I completely approve of such sanity checks; it seems that I'm just trying 
>>> to do something that it was not designed for :-) All I'm really interested 
>>> in is a way to stream my BAM file and I'm looking forward to any 
>>> suggestion. I especially don't want to re-invent the wheel if you have 
>>> already planned something. If you haven't I'd be glad to get some insight 
>>> how I can walk around that problem.
>>> 
>>> My sessionInfo:
>>> 
>>> R version 3.0.1 (2013-05-16)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>> 
>>> locale:
>>> [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>>> [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>>> [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>>> [7] LC_PAPER=C                 LC_NAME=C
>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>> 
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>> 
>>> other attached packages:
>>> [1] BiocInstaller_1.11.3  Rsamtools_1.13.22     Biostrings_2.29.12
>>> [4] GenomicRanges_1.13.26 XVector_0.1.0         IRanges_1.19.15
>>> [7] BiocGenerics_0.7.2
>>> 
>>> loaded via a namespace (and not attached):
>>> [1] bitops_1.0-5   stats4_3.0.1   zlibbioc_1.7.0
>> 
>> 
>> Looks like you are using Bioc-devel. Did you get all the
>> warnings about GappedAlignments, readBamGappedAlignments(),
>> and GappedAlignments() being deprecated?
> 
> I though I did, but indeed I didn't get the warnings then. This is very 
> strange.
> 
>> 
>> I thought you were using the release so that's what I used:
>> 
>>> sessionInfo()
>> R version 3.0.0 (2013-04-03)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> 
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C                 LC_NAME=C
>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> 
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>> [8] base
>> 
>> other attached packages:
>> [1] RNAseqData.HNRNPC.bam.chr14_0.1.3 Rsamtools_1.12.3
>> [3] Biostrings_2.28.0                 GenomicRanges_1.12.4
>> [5] IRanges_1.18.1                    BiocGenerics_0.6.0
>> 
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-5   stats4_3.0.0   zlibbioc_1.6.0
>> 
>> 
>> The timings I get with Bioc-devel are pretty much the same though.
>> 
>> Something doesn't seem to be quite right with your cluster.
> 
> I agree, I'll check that out.
> 
>> What happens
>> if you try to rbind() 2 data.frames of 100000 rows each in a fresh
>> session?
>> 
>>> df <- data.frame(aa=1:100000, bb=100000:1, cc="cc", dd="dd")
>>> system.time(df2 <- rbind(df, df))
>>    user  system elapsed
>>   0.204   0.000   0.206
>> 
> 
> Good point. I'll try that out and let you know.
> 
> Thanks for the very detailed answer!
> 
> Cheers,
> 
> Nico
> 
> 
>> Thanks,
>> H.
>> 
>>> 
>>> Cheers,
>>> 
>>> Nico
>>> 
>>> ---------------------------------------------------------------
>>> Nicolas Delhomme
>>> 
>>> Genome Biology Computational Support
>>> 
>>> European Molecular Biology Laboratory
>>> 
>>> Tel: +49 6221 387 8310
>>> Email: nicolas.delho...@embl.de
>>> Meyerhofstrasse 1 - Postfach 10.2209
>>> 69102 Heidelberg, Germany
>>> 
>>> _______________________________________________
>>> Bioc-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>> 
>> 
>> -- 
>> Hervé Pagès
>> 
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>> 
>> E-mail: hpa...@fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
> 
> _______________________________________________
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] appending 2 GappedAlignments using "c" takes long

Reply via email to