Re: [aroma.affymetrix] Speeding up RmaBackgroundCorrection

Henrik Bengtsson Fri, 07 Mar 2014 08:38:32 -0800

On Fri, Mar 7, 2014 at 8:05 AM, Damian Plichta <damian.plic...@gmail.com> wrote:
> Hi Henrik,
>
> Lowering memory helped - it's drinks on me when we meet.
>
> It has been running for approximately 7 days now ("ETA for unit type
> 'expression': 20140320 23:20:26"). With the updated affxparser can I speed
> it up by increasing the memory burden? The current memory allocation is
> approximately 3 Gbytes. I don't want to cancel the run if it means loosing
> the progress though.


PLM fitting is done in chunks of units.   When a new chunk starts the
estimates of the previous one are guaranteed to have been saved to
disk.  After that you can interrupts the processing at any point and
safely restart.  All previously processed chunks will be skipped.  The
current chunk that was interrupted will have to be redone from
scratch.

When you increase option "memory/ram" the chunks will be bigger, that
is, more units will be processed per chunk.  Given a fix "memory/ram"
setting, the number of units per chunk will go down as the number of
arrays increases, e.g. doubling the number of arrays will half the
number of units processed per chunk.

Increasing "memory/ram" makes a big difference particularly if there
are only a small number of units per chunk.  The there is a relatively
larger disk I/O overhead of reading probe intensities and storing
parameter estimates.  This is mainly because the file system can
impossibly cache the content of all 1000's arrays, i.e. it reads a few
units of one array, then goes to the next array and so on.  Also, the
more the probes are scattered on the array the more they are also
scattered in the CEL files, meaning when reading those units from one
file, the file system has to "skip" through a large portion of the
file (skipping is cheap, but it is still more efficient to read things
nearby rather than scattered and it is more likely that the file cache
will be successful).  When you increase "memory/ram" you read more
units and therefore you lower the fraction of skipped bytes versus
read ones.  This is what I believe brings the most speedup when
increasing "memory/ram".  Eventually I *think* this payoff will be
relative small and there is little/no longer a need to increase
"memory/ram".

So, yes, you can interrupt your script, update affxparser and increase
"memory/ram", restart R and restart your script safely.    After each
chunk is completed, there are some timing statistics on read, write,
and fitting overhead in addition to the ETA estimate.  Have a look at
those, to see if changing the settings makes a difference.  Please
report back to share you experience - there are some other user
benchmarks related to this on
http://aroma-project.org/howtos/ImproveProcessingTime

/Henrik


>
> Best,
> Damian
>
>
> On Tuesday, March 4, 2014 8:32:17 PM UTC-5, Henrik Bengtsson wrote:
>>
>> Did lowering "memory/ram" solve your problem?
>>
>> Also, an updated version of affxparser that no longer should overflow
>> by the integer multiplication is available (on Bioconductor).
>>
>> Cheers,
>>
>> Henrik
>>
>> On Thu, Feb 27, 2014 at 12:36 PM, Henrik Bengtsson
>> <henrik.b...@aroma-project.org> wrote:
>> > Congratulations Damian,
>> >
>> > I think your the first one to hit a limit of the Aroma Framework
>> > (remind me to by you a drink whenever you see me in person).
>> >
>> > I narrowed it down to the affxparser(*) package and I'll investigate
>> > further on how to fix this.  It should not occur and I'm confident
>> > that it can be avoided internally.  In the meanwhile, try to lower
>> > your 'memory/ram' setting, e.g. setOption(aromaSettings, "memory/ram",
>> > 10.0) or less.  I'm not 100% sure it'll help, but if it does, that's a
>> > good clue (for me) on what's causing it.
>> >
>> > /Henrik
>> >
>> > DETAILS: The below illustrates the issue in affxparser::readCelUnits():
>> >
>> >> .Machine$integer.max
>> > [1] 2147483647
>> >> nbrOfArrays <- 5622L
>> >> .Machine$integer.max / nbrOfArrays
>> > [1] 381978.6
>> >> nbrOfCells <- 381978L
>> >> nbrOfCells * nbrOfArrays
>> > [1] 2147480316
>> >> nbrOfCells <- 381979L
>> >> nbrOfCells * nbrOfArrays
>> > [1] NA
>> > Warning message:
>> > In nbrOfCells * nbrOfArrays : NAs produced by integer overflow
>> >
>> > By decreasing 'memory/ram' I *hope* that 'nbrOfCells' effectively
>> > becomes smaller.
>> >
>> >
>> > On Wed, Feb 26, 2014 at 9:15 PM, Damian Plichta
>> > <damian....@gmail.com> wrote:
>> >> Hi Henrik,
>> >>
>> >> Thank you, that was helpful.
>> >>
>> >> I run to another problem though. I am trying to perform
>> >> ExonRmaPlm(csQN,
>> >> merge=TRUE) but this produces a following error:
>> >>
>> >> 20140226 23:25:33|       Identifying CDF cell indices...done
>> >> Error in vector("double", nbrOfCells * nbrOfArrays) :
>> >>   vector size cannot be NA
>> >> In addition: Warning message:
>> >> In nbrOfCells * nbrOfArrays : NAs produced by integer overflow
>> >> 20140226 23:28:35|      Reading probe intensities from 5622
>> >> arrays...done
>> >> 20140226 23:28:35|     Fitting chunk #1 of 1 of 'expression' units
>> >> (code=1)
>> >> with various dimensions...done
>> >> 20140226 23:28:35|    Unit dimension #3 (various dimensions) of
>> >> 3...done
>> >> 20140226 23:28:35|   Fitting the model by unit dimensions (at least for
>> >> the
>> >> large classes)...done
>> >> 20140226 23:28:35|  Unit type #1 ('expression') of 1...done
>> >> 20140226 23:28:35| Fitting ExonRmaPlm for each unit type
>> >> separately...done
>> >> 20140226 23:28:35|Fitting model of class ExonRmaPlm...done
>> >>
>> >> I testes whether it worked anyway, but the expression is zero across
>> >> all
>> >> arrays when I access it.
>> >>
>> >> Do you know what could be causing the problem?
>> >>
>> >> Best,
>> >> Damian
>> >>
>> >>
>> >> The code I run is below:
>> >>
>> >> library(aroma.affymetrix)
>> >>
>> >> library(aroma.core)
>> >>
>> >> setOption(aromaSettings, "memory/ram", 500.0);
>> >>
>> >> verbose <- Arguments$getVerbose(-8, timestamp=TRUE)
>> >>
>> >> chipType <- "HuEx-1_0-st-v2-core"
>> >>
>> >> cdf <- AffymetrixCdfFile$byChipType(chipType)
>> >>
>> >> #print(cdf)
>> >>
>> >> cs <- AffymetrixCelSet$byName("experiment1", cdf=cdf)
>> >>
>> >> bc <- RmaBackgroundCorrection(cs)
>> >>
>> >> csBC <- process(bc,verbose=verbose)
>> >>
>> >> qn <- QuantileNormalization(csBC, typesToUpdate="pm")
>> >>
>> >> target <- getTargetDistribution(qn, verbose=verbose)
>> >>
>> >> qn <- QuantileNormalization(csBC, typesToUpdate="pm",
>> >> targetDistribution=target)
>> >>
>> >> csQN <- process(qn, verbose=verbose)
>> >>
>> >> csPLM <- ExonRmaPlm(csQN, mergeGroups=TRUE)
>> >>
>> >> fit(csPLM, verbose=verbose)
>> >>
>> >> date()
>> >>
>> >> ces <- getChipEffectSet(csPLM)
>> >>
>> >> gExprs <- extractDataFrame(ces, units=1:3, addNames=TRUE)
>> >>
>> >>
>> >>> sessionInfo()
>> >> R version 3.0.2 (2013-09-25)
>> >> Platform: x86_64-unknown-linux-gnu (64-bit)
>> >>
>> >> locale:
>> >>  [1] LC_CTYPE=C                 LC_NUMERIC=C
>> >>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> >>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>> >>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> >>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> >>
>> >> attached base packages:
>> >> [1] stats     graphics  grDevices utils     datasets  methods   base
>> >>
>> >> other attached packages:
>> >>  [1] preprocessCore_1.23.0   aroma.light_1.31.8      matrixStats_0.8.14
>> >>  [4] aroma.affymetrix_2.11.1 aroma.core_2.11.0       R.devices_2.8.2
>> >>  [7] R.filesets_2.3.0        R.utils_1.29.8          R.oo_1.17.0
>> >> [10] affxparser_1.34.0       R.methodsS3_1.6.1
>> >>
>> >> loaded via a namespace (and not attached):
>> >> [1] aroma.apd_0.4.0 base64enc_0.1-1 digest_0.6.4    DNAcopy_1.35.1
>> >> [5] PSCBS_0.40.4    R.cache_0.9.2   R.huge_0.6.0    R.rsp_0.9.28
>> >> [9] tools_3.0.2
>> >>
>> >> On Thursday, February 20, 2014 1:21:25 PM UTC-5, Henrik Bengtsson
>> >> wrote:
>> >>>
>> >>> On Tue, Feb 18, 2014 at 7:30 PM, Damian Plichta
>> >>> <damian....@gmail.com> wrote:
>> >>> > Thanks, that helped a lot. It took me less than 3 hours to perform
>> >>> > the
>> >>> > background correction.
>> >>> >
>> >>> > Now I'm wondering if for the next step, quantile normalization, I
>> >>> > could
>> >>> > do a
>> >>> > similar trick. Is there a way to precompute the target empirical
>> >>> > distribution based on all arrays and then do the normalization on
>> >>> > chunks
>> >>> > of
>> >>> > data (thus in an independent manner)? I can see the option
>> >>> > targetDistribution under QuantileNormalization.
>> >>>
>> >>> # Calculate the target distribution based on *all* arrays [not
>> >>> parallalized]
>> >>> qn <- QuantileNormalization(dsC, typesToUpdate="pm")
>> >>> target <- getTargetDistribution(qn, verbose=verbose)
>> >>>
>> >>> # Normalize array by array toward the same target distribution [in
>> >>> chucks]
>> >>> dsCs <- extract(dsC, 1:100)
>> >>> qn <- QuantileNormalization(dsCs, typesToUpdate="pm",
>> >>> targetDistribution=target)
>> >>> csNs <- process(qn, verbose=verbose)
>> >>>
>> >>> Hope this helps
>> >>>
>> >>> /Henrik
>> >>>
>> >>> >
>> >>> > Kind regards,
>> >>> >
>> >>> > Damian Plichta
>> >>> >
>> >>> > On Monday, February 17, 2014 4:03:54 PM UTC-5, Henrik Bengtsson
>> >>> > wrote:
>> >>> >>
>> >>> >> Hi.
>> >>> >>
>> >>> >> On Sun, Feb 16, 2014 at 6:53 PM, Damian Plichta
>> >>> >> <damian....@gmail.com> wrote:
>> >>> >> > Hi,
>> >>> >> >
>> >>> >> > I'm processing around 5500 affymetrix exon arrays. The
>> >>> >> > RmaBackgroundCorrection() is pretty slow, 1-2 minutes/array. I
>> >>> >> > played
>> >>> >> > with
>> >>> >> > setOption(aromaSettings, "memory/ram", X) and increased X up to
>> >>> >> > 100
>> >>> >> > but
>> >>> >> > it
>> >>> >> > didn't have any effect on this stage of analysis.
>> >>> >>
>> >>> >> If you don't notice any difference in processing time by changing
>> >>> >> "memory/ram" from the default (1.0) to 100, then the memory is not
>> >>> >> your bottleneck.
>> >>> >> >
>> >>> >> > Any way to speed the process up?
>> >>> >>
>> >>> >> If you haven't already, make sure to read "How to: Improve
>> >>> >> processing
>> >>> >> time":
>> >>> >>
>> >>> >>   http://aroma-project.org/howtos/ImproveProcessingTime
>> >>> >>
>> >>> >> If you have access to multiple machines on the same file system,
>> >>> >> you
>> >>> >> can do poor mans parallel processing for the *background
>> >>> >> correction*,
>> >>> >> because each array is corrected independently of the others.  You
>> >>> >> can
>> >>> >> do this by processing a subset of arrays per computer, e.g.
>> >>> >>
>> >>> >> dsR <- AffymetrixCelSet$byName("MyDataSet",
>> >>> >> chipType="HuEx-1_0-st-v2")
>> >>> >> dsR <- extract(dsR, 1:100)
>> >>> >> bg <- RmaBackgroundCorrection(dsS)
>> >>> >> dsC <- process(bg, verbose=verbose)
>> >>> >>
>> >>> >> Repeat on another machine with 101:200, and so on.
>> >>> >>
>> >>> >> When all arrays have been background corrected, you can move back
>> >>> >> to
>> >>> >> your original script - all arrays background corrected are already
>> >>> >> saved to file and will therefore not be redone.
>> >>> >>
>> >>> >> /Henrik
>> >>> >>
>> >>> >> >
>> >>> >> > Kind regards,
>> >>> >> >
>> >>> >> > Damian Plichta
>> >>> >> >
>> >>> >> > --
>> >>> >> > --
>> >>> >> > When reporting problems on aroma.affymetrix, make sure 1) to run
>> >>> >> > the
>> >>> >> > latest
>> >>> >> > version of the package, 2) to report the output of sessionInfo()
>> >>> >> > and
>> >>> >> > traceback(), and 3) to post a complete code example.
>> >>> >> >
>> >>> >> >
>> >>> >> > You received this message because you are subscribed to the
>> >>> >> > Google
>> >>> >> > Groups
>> >>> >> > "aroma.affymetrix" group with website
>> >>> >> > http://www.aroma-project.org/.
>> >>> >> > To post to this group, send email to aroma-af...@googlegroups.com
>> >>> >> > To unsubscribe and other options, go to
>> >>> >> > http://www.aroma-project.org/forum/
>> >>> >> >
>> >>> >> > ---
>> >>> >> > You received this message because you are subscribed to the
>> >>> >> > Google
>> >>> >> > Groups
>> >>> >> > "aroma.affymetrix" group.
>> >>> >> > To unsubscribe from this group and stop receiving emails from it,
>> >>> >> > send
>> >>> >> > an
>> >>> >> > email to aroma-affymetr...@googlegroups.com.
>> >>> >> > For more options, visit https://groups.google.com/groups/opt_out.
>> >>> >
>> >>> > --
>> >>> > --
>> >>> > When reporting problems on aroma.affymetrix, make sure 1) to run the
>> >>> > latest
>> >>> > version of the package, 2) to report the output of sessionInfo() and
>> >>> > traceback(), and 3) to post a complete code example.
>> >>> >
>> >>> >
>> >>> > You received this message because you are subscribed to the Google
>> >>> > Groups
>> >>> > "aroma.affymetrix" group with website http://www.aroma-project.org/.
>> >>> > To post to this group, send email to aroma-af...@googlegroups.com
>> >>> > To unsubscribe and other options, go to
>> >>> > http://www.aroma-project.org/forum/
>> >>> >
>> >>> > ---
>> >>> > You received this message because you are subscribed to the Google
>> >>> > Groups
>> >>> > "aroma.affymetrix" group.
>> >>> > To unsubscribe from this group and stop receiving emails from it,
>> >>> > send
>> >>> > an
>> >>> > email to aroma-affymetr...@googlegroups.com.
>> >>> > For more options, visit https://groups.google.com/groups/opt_out.
>> >>
>> >> --
>> >> --
>> >> When reporting problems on aroma.affymetrix, make sure 1) to run the
>> >> latest
>> >> version of the package, 2) to report the output of sessionInfo() and
>> >> traceback(), and 3) to post a complete code example.
>> >>
>> >>
>> >> You received this message because you are subscribed to the Google
>> >> Groups
>> >> "aroma.affymetrix" group with website http://www.aroma-project.org/.
>> >> To post to this group, send email to aroma-af...@googlegroups.com
>> >> To unsubscribe and other options, go to
>> >> http://www.aroma-project.org/forum/
>> >>
>> >> ---
>> >> You received this message because you are subscribed to the Google
>> >> Groups
>> >> "aroma.affymetrix" group.
>> >> To unsubscribe from this group and stop receiving emails from it, send
>> >> an
>> >> email to aroma-affymetr...@googlegroups.com.
>> >> For more options, visit https://groups.google.com/groups/opt_out.
>
> --
> --
> When reporting problems on aroma.affymetrix, make sure 1) to run the latest
> version of the package, 2) to report the output of sessionInfo() and
> traceback(), and 3) to post a complete code example.
>
>
> You received this message because you are subscribed to the Google Groups
> "aroma.affymetrix" group with website http://www.aroma-project.org/.
> To post to this group, send email to aroma-affymetrix@googlegroups.com
> To unsubscribe and other options, go to http://www.aroma-project.org/forum/
>
> ---
> You received this message because you are subscribed to the Google Groups
> "aroma.affymetrix" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to aroma-affymetrix+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
-- 
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group with website http://www.aroma-project.org/.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe and other options, go to http://www.aroma-project.org/forum/

--- 
You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to aroma-affymetrix+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [aroma.affymetrix] Speeding up RmaBackgroundCorrection

Reply via email to