[aroma.affymetrix] Re: Problem with GLAD on linux cluster

Henrik Bengtsson Wed, 04 Aug 2010 01:21:15 -0700

Hi Christian,

On Wed, Aug 4, 2010 at 9:04 AM, cstratowa
<christian.strat...@vie.boehringer-ingelheim.com> wrote:
> Dear Henrik,
>
> Thank you for your suggestion to use ceRef directly.
>
> Regarding your explanation of getAverageFile() the question is where
> the generated output will be saved.
>
> As I have mentioned, each node creates first a plmData subdirectory,
> e.g. "Prostate/Prostate21/plmData" and makes symbolic links to the
> normalized CEL-files located in "Prostate/plmData". Thus the output of
> getAverageFile() should be stored for each node separately.


Ah, now I see; I've been reading it as you were linking the
directories, not the individual CEL files.

>
> This seems indeed to be the case, since e.g. the subdirectory
> "Prostate/Prostate21/plmData/Prostate,ACC,-XY,QN,RMA,A+B,FLN,-XY/
> Mapping250K_Nsp" contains the file ".average-intensities-median-
> mad,a1c33926939ee43fbed83ae69301d215.CEL" created at a certain time
> while subdirectory "Prostate/Prostate8/plmData/Prostate,ACC,-
> XY,QN,RMA,A+B,FLN,-XY/Mapping250K_Nsp" contains a file with the same
> name, i.e. ".average-intensities-median-
> mad,a1c33926939ee43fbed83ae69301d215.CEL" created at a different
> time.

Yes.

As I understand it now, you preprocess all of the data, and wait for
everything to be done (all *,chipEffects.CEL files to be generated)
before continuing with the above, correct?  If so, I'd suggest that
you also wait for getAverageFile() to finish first.  Then that average/
results file be available to all your cluster nodes as well.  I even
think you don't have to link each CEL file separately, because nothing
else should be written back to the data set.  It should be enough to
link each data set directory, or even just plmData/ itself (not even
sure the need to split it up anymore).

>
> As far as I understand these are the files created by getAverageFile()
> and thus each node creates its own file saved in its own subdirectory,
> so there will be no problem.

Yes.  Now I agree with you.

>
> It seems that the problem was indeed the result of saveObject() stored
> in ".Rcache", which caused the race conditions. Since the removal of
> saveObject() I have until now experienced no problems.

Yes.  You are correct.

Since caching is mainly done for memoization purposes, that is, to
load already calculated results that are computational expensive to
obtain from file, it is recommended to store the cache in a fast
place.  In other words, it is better if the .Rcache directory is on
the local drive of the machine, rather than on a shared file system.
If you had done that, then each machine would had to have do those
calculations by themselves once, but when done the memoization would
be faster and you would not have had any race conditions accessing the
memoized results.  The default ~/.Rcache/ can be changed, cf.
http://www.aroma-project.org/archive/GoogleGroups/web/caching.

This was a useful conversation to me; it made me see other ways for
(unnecessary) race conditions to occur, and remind me how important it
is to not overlook the smallest details in scientific communication
since they can make big differences.

Cheers,

Henrik

>
> Thank you for your help.
> Best regards
> Christian
>
> On Aug 2, 2:54 pm, Henrik Bengtsson <h...@stat.berkeley.edu> wrote:
>
>
>
> > Hi.
>
> > On Mon, Jul 26, 2010 at 12:00 PM, cstratowa
>
> > <christian.strat...@vie.boehringer-ingelheim.com> wrote:
> > > Dear Henrik,
>
> > > Maybe, my explanation was not clear enough:
>
> > > I have created my own package based on S4 classes, where one subclass
> > > is "AromaSNP" with slots celset, normset, plmset, effectset as lists,
> > > and methods readSNPData(), normalizeSNPData(), computeCN(),
> > > computeRawCN(), among others. Furthermore, the package includes
> > > scripts batch.aroma.norm.R, batch.aroma.model.R,
> > > batch.aroma.combine.R, and a perl script which distributes these
> > > scripts to the different cluster nodes.
>
> > > 1, Normalization: Script batch.aroma.norm.R creates first the
> > > subdirectory structure which I have already described, and then does
> > > the normalization. All normalization steps run on one server and the
> > > results are saved as AromaSNP object "aroma" in "Prostate/
> > > Prostate.Rdata". Furthermore, subdirectories "Prostate/probeData" and
> > > "Prostate/plmData" are created.
>
> > > 2, GLAD: Script batch.aroma.norm.R is called from each node
> > > separately. For each node it creates first a plmData subdirectory,
> > > e.g. "Prostate/Prostate21/plmData" and makes symbolic links to the
> > > normalized CEL-files located in "Prostate/plmData". Then it loads
> > > object "aroma" from "Prostate/Prostate.Rdata", whereby each node has a
> > > separate RAM of 2GB. Slot "ar...@effectset" contains the normalized
> > > data and is called from computeCN() as "cesList <- ar...@effectset".
> > > This "cesList" (which is in the RAM of each node) is passed to "model
> > > <- GladModel(cesList, refList)", and is thus used to compute
> > > getAverageFile(), if refList=NULL (which is the default). Since each
> > > node calls the same "cesList" object, function saveObject() writes to
> > > identical filenames which causes the clash.
>
> > > Here is the relevant code for each cluster node:
>
> > > cesList <- ar...@effectset
> > > refList <- NULL  #but see below
> > > model <- GladModel(cesList, refList)
> > > ce  <- ChromosomeExplorer(model, tags=tags)
> > > setParallelSafe(ce, status=TRUE) #suggested by you long time ago
> > > tmp <- process(ce,...)
> > > cnrs <- getRegions(model, ...)
> > > ar...@cnregion <- cnrs
>
> > > Each node saves the result, i.e. object "aroma", in a different Rdata
> > > file, e.g. "Prostate/Prostate.GLAD.21.Rdata" and saves the images in
> > > its own reports subdirectory, e.g. "Prostate/Prostate21/reports".
>
> > > 3, Script batch.aroma.combine.R combines the results of each Rdata
> > > file in a final file "Prostate/Prostate.GLAD.All.Rdata", combines all
> > > images from the different subdirectories in "Prostate/reports", and
> > > finally deletes all temporary Rdata files and subdirectories.
>
> > > Since you mention that I should call getAverageFile() before calling
> > > GladModel(), here is one more information:
> > > All necessary information is stored in a text-file "pheno.txt", which
> > > is read initially and contains the names of the CEL-files, the
> > > location of each CEL-file, an alias name for each CEL-file, and
> > > optionally columns "Reference" or "Pairs". By default "refList <-
> > > NULL", however, when I activate option "Reference" then all CEL-files
> > > with "1" in column "Reference" are used as reference. Here is the
> > > relevant code:
>
> > > cesList <- ar...@effectset
> > > refList <- list();
> > > for (chiptype in names(cesList)) {
> > >   ces <- cesList[[chiptype]];
> > >   refcol <- which(pheno[,reference] == 1);
> > >   datcol <- which(pheno[,reference] == 0);
> > >   if (reference == "Pairs") {
> > >      cesList[[chiptype]] <- extract(ces, datcol);
> > >      refList[[chiptype]] <- extract(ces, refcol);
> > >   } else {
> > >      cesRef <- extract(ces, refcol);
> > >      ceRef  <- getAverageFile(cesRef);
> > >      ## convert single ceRef file into a set of identical files of
> > > same size/length as ces
> > >      numarr <- nbrOfArrays(ces);
> > >      cesRef <- rep(list(ceRef), numarr);
> > >      cesRef <- newInstance(ces, cesRef, mergeStrands=ces
> > > $mergeStrands, combineAlleles=ces$combineAlleles);
>
> > FYI, aroma.* has been updated, so I think you can drop the latter
> > three rows, and simply use:
>
> > >      refList[[chiptype]] <- ceRef;
>
> > It detect this and do that "trick" internally.
>
> > >      refList[[chiptype]] <- cesRef;
> > >   }#if
> > > }#for
> > > model <- GladModel(cesList, refList);
>
> > > Thus, if really necessary, I can always create "pheno.txt" with
> > > Reference column "1" for all CEL-files in order to call
> > > getAverageFile(). However, I must admit that it is still not clear to
> > > me what the advantage would be, especially now that you have removed
> > > function saveObject().
>
> > So, what I am trying to say is that:
>
> > 1. GladModel(cesList, refList) uses a reference 'refList'.
> > 2. If refList=NULL, then it correponds to using refList <-
> > lapply(cesList, FUN=getAverageFile).
> > 3. You can also build your own refList.  Unless you do a paired
> > analysis, refList will probably be calculated by using
> > getAverageFile() on some data set.  In your example, you extract a
> > subset (of cesList <- ar...@effectset) that are reference samples
> > ('cesRef').
>
> > In all cases, whatever set of reference samples you use, they must be
> > available at the time of the calculation getAverageFile().  That will
> > always be the case.  However, the output/result file generated by
> > getAverageFile() is not always available.  First time one of your
> > processes completes a getAverageFile() call, a new file will be
> > created and stored on your file system.  It's name will be a md5
> > checksum that is generated from the names of the arrays in the set
> > that you call getAverageFile() on.  If you do it twice for the same
> > set of arrays, you will the second time get the results stored on
> > file, because they have already been calculated.
>
> > So far so good, the race condition occurs when you have two processes
> > A and B that operates on the same data set 'cesList'.  Process A runs
> > the script, it request the reference which is missing and starts
> > running getAverageFile(cesList[[1]]).  While this is done, Process B
> > starts doing the same thing, and since the *result file* of
> > getAverageFile(cesList[[1]]) is not available, it starts doing the
> > same thing.  Now Process A finish and writes its result file.  Later
> > Process B writes its results to the same result file, because they
> > process the same data set, more precisely getNames(cesList[[1]]) are
> > the same.  If Process B starts writing at the same time as Process A
> > writes, there is a potential problem.
>
> > From my troubleshooting, as far as I understands it, the only way you
> > could have gotten that error message was when two or more processes
> > did getAverageFile(cesList[[1]]) where getNames(cesList[[1]]) where
> > identical.  Are you 100% sure that is not the case? Are you saying
> > that is not the case?  If not, I am really puzzled how there could be
> > a clash in the first place.  Thus, the key point is to make sure that
> > multiple processing are not trying to calculate getAverageFile() on
> > the same array set at the same time.
>
> > /Henrik
>
> > > I hope that this explanation could explain better what the different
> > > steps are.
>
> > > Best regards
> > > Christian
>
> > > On Jul 23, 4:35 pm, Henrik Bengtsson <henrik.bengts...@gmail.com>
> > > wrote:
> > >> Hi.
>
> > >> On Jul 22, 10:24 am, cstratowa <christian.strat...@vie.boehringer-
>
> > >> ingelheim.com> wrote:
> > >> > Dear Henrik,
>
> > >> > Thank you very much for changing the code for getAverageFile(), I will
> > >> > try it and let you know.
>
> > >> > Thank you also for the explanation of writing to a temporary file, now
> > >> > I understand your intention.
>
> > >> > Regarding race conditions: No, I do not assume that aroma.* takes care
> > >> > of potential race conditions. Here is what I do:
>
> > >> > Assume that I have downloaded from GEO a prostate cancer dataset
> > >> > consisting of 40 CEL-files. Then I create a directory "Prostate" and
> > >> > subdirectories "Prostate/annotationData" and "Prostate/rawData"
> > >> > following your required file structure.
>
> > >> >  However, starting with the 2nd CEL-file I create subdirectories
> > >> > "Prostate/Prostate2",...,"Prostate/Prostate40", each containing a
> > >> > symbolic link to "../annotationData" and "../rawData" from "Prostate".
>
> ...
>
> read more »

-- 
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group with website http://www.aroma-project.org/.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe and other options, go to http://www.aroma-project.org/forum/

[aroma.affymetrix] Re: Problem with GLAD on linux cluster

Reply via email to