[aroma.affymetrix] Re: Problem with GLAD on linux cluster

cstratowa Wed, 04 Aug 2010 02:17:04 -0700

Dear Henrik,

Thank you, this discussion was very helpful for me.


Your suggestion to compute getAverageFile() first is a good idea, I
will try it when I find time to change my code again. It is also good
to know that I can change ".Rcache" by using setCacheRootPath().

Best regards
Christian


On Aug 4, 10:21 am, Henrik Bengtsson <henrik.bengts...@gmail.com>
wrote:
> Hi Christian,
>
> On Wed, Aug 4, 2010 at 9:04 AM, cstratowa
>
> <christian.strat...@vie.boehringer-ingelheim.com> wrote:
> > Dear Henrik,
>
> > Thank you for your suggestion to use ceRef directly.
>
> > Regarding your explanation of getAverageFile() the question is where
> > the generated output will be saved.
>
> > As I have mentioned, each node creates first a plmData subdirectory,
> > e.g. "Prostate/Prostate21/plmData" and makes symbolic links to the
> > normalized CEL-files located in "Prostate/plmData". Thus the output of
> > getAverageFile() should be stored for each node separately.
>
> Ah, now I see; I've been reading it as you were linking the
> directories, not the individual CEL files.
>
>
>
> > This seems indeed to be the case, since e.g. the subdirectory
> > "Prostate/Prostate21/plmData/Prostate,ACC,-XY,QN,RMA,A+B,FLN,-XY/
> > Mapping250K_Nsp" contains the file ".average-intensities-median-
> > mad,a1c33926939ee43fbed83ae69301d215.CEL" created at a certain time
> > while subdirectory "Prostate/Prostate8/plmData/Prostate,ACC,-
> > XY,QN,RMA,A+B,FLN,-XY/Mapping250K_Nsp" contains a file with the same
> > name, i.e. ".average-intensities-median-
> > mad,a1c33926939ee43fbed83ae69301d215.CEL" created at a different
> > time.
>
> Yes.
>
> As I understand it now, you preprocess all of the data, and wait for
> everything to be done (all *,chipEffects.CEL files to be generated)
> before continuing with the above, correct?  If so, I'd suggest that
> you also wait for getAverageFile() to finish first.  Then that average/
> results file be available to all your cluster nodes as well.  I even
> think you don't have to link each CEL file separately, because nothing
> else should be written back to the data set.  It should be enough to
> link each data set directory, or even just plmData/ itself (not even
> sure the need to split it up anymore).
>
>
>
> > As far as I understand these are the files created by getAverageFile()
> > and thus each node creates its own file saved in its own subdirectory,
> > so there will be no problem.
>
> Yes.  Now I agree with you.
>
>
>
> > It seems that the problem was indeed the result of saveObject() stored
> > in ".Rcache", which caused the race conditions. Since the removal of
> > saveObject() I have until now experienced no problems.
>
> Yes.  You are correct.
>
> Since caching is mainly done for memoization purposes, that is, to
> load already calculated results that are computational expensive to
> obtain from file, it is recommended to store the cache in a fast
> place.  In other words, it is better if the .Rcache directory is on
> the local drive of the machine, rather than on a shared file system.
> If you had done that, then each machine would had to have do those
> calculations by themselves once, but when done the memoization would
> be faster and you would not have had any race conditions accessing the
> memoized results.  The default ~/.Rcache/ can be changed, 
> cf.http://www.aroma-project.org/archive/GoogleGroups/web/caching.
>
> This was a useful conversation to me; it made me see other ways for
> (unnecessary) race conditions to occur, and remind me how important it
> is to not overlook the smallest details in scientific communication
> since they can make big differences.
>
> Cheers,
>
> Henrik
>
>
>
> > Thank you for your help.
> > Best regards
> > Christian
>
> > On Aug 2, 2:54 pm, Henrik Bengtsson <h...@stat.berkeley.edu> wrote:
>
> > > Hi.
>
> > > On Mon, Jul 26, 2010 at 12:00 PM, cstratowa
>
> > > <christian.strat...@vie.boehringer-ingelheim.com> wrote:
> > > > Dear Henrik,
>
> > > > Maybe, my explanation was not clear enough:
>
> > > > I have created my own package based on S4 classes, where one subclass
> > > > is "AromaSNP" with slots celset, normset, plmset, effectset as lists,
> > > > and methods readSNPData(), normalizeSNPData(), computeCN(),
> > > > computeRawCN(), among others. Furthermore, the package includes
> > > > scripts batch.aroma.norm.R, batch.aroma.model.R,
> > > > batch.aroma.combine.R, and a perl script which distributes these
> > > > scripts to the different cluster nodes.
>
> > > > 1, Normalization: Script batch.aroma.norm.R creates first the
> > > > subdirectory structure which I have already described, and then does
> > > > the normalization. All normalization steps run on one server and the
> > > > results are saved as AromaSNP object "aroma" in "Prostate/
> > > > Prostate.Rdata". Furthermore, subdirectories "Prostate/probeData" and
> > > > "Prostate/plmData" are created.
>
> > > > 2, GLAD: Script batch.aroma.norm.R is called from each node
> > > > separately. For each node it creates first a plmData subdirectory,
> > > > e.g. "Prostate/Prostate21/plmData" and makes symbolic links to the
> > > > normalized CEL-files located in "Prostate/plmData". Then it loads
> > > > object "aroma" from "Prostate/Prostate.Rdata", whereby each node has a
> > > > separate RAM of 2GB. Slot "ar...@effectset" contains the normalized
> > > > data and is called from computeCN() as "cesList <- ar...@effectset".
> > > > This "cesList" (which is in the RAM of each node) is passed to "model
> > > > <- GladModel(cesList, refList)", and is thus used to compute
> > > > getAverageFile(), if refList=NULL (which is the default). Since each
> > > > node calls the same "cesList" object, function saveObject() writes to
> > > > identical filenames which causes the clash.
>
> > > > Here is the relevant code for each cluster node:
>
> > > > cesList <- ar...@effectset
> > > > refList <- NULL  #but see below
> > > > model <- GladModel(cesList, refList)
> > > > ce  <- ChromosomeExplorer(model, tags=tags)
> > > > setParallelSafe(ce, status=TRUE) #suggested by you long time ago
> > > > tmp <- process(ce,...)
> > > > cnrs <- getRegions(model, ...)
> > > > ar...@cnregion <- cnrs
>
> > > > Each node saves the result, i.e. object "aroma", in a different Rdata
> > > > file, e.g. "Prostate/Prostate.GLAD.21.Rdata" and saves the images in
> > > > its own reports subdirectory, e.g. "Prostate/Prostate21/reports".
>
> > > > 3, Script batch.aroma.combine.R combines the results of each Rdata
> > > > file in a final file "Prostate/Prostate.GLAD.All.Rdata", combines all
> > > > images from the different subdirectories in "Prostate/reports", and
> > > > finally deletes all temporary Rdata files and subdirectories.
>
> > > > Since you mention that I should call getAverageFile() before calling
> > > > GladModel(), here is one more information:
> > > > All necessary information is stored in a text-file "pheno.txt", which
> > > > is read initially and contains the names of the CEL-files, the
> > > > location of each CEL-file, an alias name for each CEL-file, and
> > > > optionally columns "Reference" or "Pairs". By default "refList <-
> > > > NULL", however, when I activate option "Reference" then all CEL-files
> > > > with "1" in column "Reference" are used as reference. Here is the
> > > > relevant code:
>
> > > > cesList <- ar...@effectset
> > > > refList <- list();
> > > > for (chiptype in names(cesList)) {
> > > >   ces <- cesList[[chiptype]];
> > > >   refcol <- which(pheno[,reference] == 1);
> > > >   datcol <- which(pheno[,reference] == 0);
> > > >   if (reference == "Pairs") {
> > > >      cesList[[chiptype]] <- extract(ces, datcol);
> > > >      refList[[chiptype]] <- extract(ces, refcol);
> > > >   } else {
> > > >      cesRef <- extract(ces, refcol);
> > > >      ceRef  <- getAverageFile(cesRef);
> > > >      ## convert single ceRef file into a set of identical files of
> > > > same size/length as ces
> > > >      numarr <- nbrOfArrays(ces);
> > > >      cesRef <- rep(list(ceRef), numarr);
> > > >      cesRef <- newInstance(ces, cesRef, mergeStrands=ces
> > > > $mergeStrands, combineAlleles=ces$combineAlleles);
>
> > > FYI, aroma.* has been updated, so I think you can drop the latter
> > > three rows, and simply use:
>
> > > >      refList[[chiptype]] <- ceRef;
>
> > > It detect this and do that "trick" internally.
>
> > > >      refList[[chiptype]] <- cesRef;
> > > >   }#if
> > > > }#for
> > > > model <- GladModel(cesList, refList);
>
> > > > Thus, if really necessary, I can always create "pheno.txt" with
> > > > Reference column "1" for all CEL-files in order to call
> > > > getAverageFile(). However, I must admit that it is still not clear to
> > > > me what the advantage would be, especially now that you have removed
> > > > function saveObject().
>
> > > So, what I am trying to say is that:
>
> > > 1. GladModel(cesList, refList) uses a reference 'refList'.
> > > 2. If refList=NULL, then it correponds to using refList <-
> > > lapply(cesList, FUN=getAverageFile).
> > > 3. You can also build your own refList.  Unless you do a paired
> > > analysis, refList will probably be calculated by using
> > > getAverageFile() on some data set.  In your example, you extract a
> > > subset (of cesList <- ar...@effectset) that are reference samples
> > > ('cesRef').
>
> > > In all cases, whatever set of reference samples you use, they must be
> > > available at the time of the calculation getAverageFile().  That will
> > > always be the case.  However, the output/result file generated by
> > > getAverageFile() is not always available.  First time one of your
> > > processes completes a getAverageFile() call, a new file will be
> > > created and stored on your file system.  It's name will be a md5
> > > checksum that is generated from the names of the arrays in the set
> > > that you call getAverageFile() on.  If you do it twice for the same
> > > set of arrays, you will the second time get the results stored on
> > > file, because they have already been calculated.
>
> > > So far so good, the race condition occurs when you have two processes
> > > A and B that operates on the same data set 'cesList'.  Process A runs
> > > the script, it request the reference which is missing and starts
> > > running getAverageFile(cesList[[1]]).  While this is done, Process B
> > > starts doing the same thing, and since the *result file* of
> > > getAverageFile(cesList[[1]]) is not available, it starts
>
> ...
>
> read more »

-- 
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group with website http://www.aroma-project.org/.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe and other options, go to http://www.aroma-project.org/forum/

[aroma.affymetrix] Re: Problem with GLAD on linux cluster

Reply via email to