[aroma.affymetrix] Re: Problem with GLAD on linux cluster

cstratowa Wed, 04 Aug 2010 00:04:15 -0700

Dear Henrik,

Thank you for your suggestion to use ceRef directly.


Regarding your explanation of getAverageFile() the question is where
the generated output will be saved.

As I have mentioned, each node creates first a plmData subdirectory,
e.g. "Prostate/Prostate21/plmData" and makes symbolic links to the
normalized CEL-files located in "Prostate/plmData". Thus the output of
getAverageFile() should be stored for each node separately.

This seems indeed to be the case, since e.g. the subdirectory
"Prostate/Prostate21/plmData/Prostate,ACC,-XY,QN,RMA,A+B,FLN,-XY/
Mapping250K_Nsp" contains the file ".average-intensities-median-
mad,a1c33926939ee43fbed83ae69301d215.CEL" created at a certain time
while subdirectory "Prostate/Prostate8/plmData/Prostate,ACC,-
XY,QN,RMA,A+B,FLN,-XY/Mapping250K_Nsp" contains a file with the same
name, i.e. ".average-intensities-median-
mad,a1c33926939ee43fbed83ae69301d215.CEL" created at a different
time.

As far as I understand these are the files created by getAverageFile()
and thus each node creates its own file saved in its own subdirectory,
so there will be no problem.

It seems that the problem was indeed the result of saveObject() stored
in ".Rcache", which caused the race conditions. Since the removal of
saveObject() I have until now experienced no problems.

Thank you for your help.
Best regards
Christian


On Aug 2, 2:54 pm, Henrik Bengtsson <h...@stat.berkeley.edu> wrote:
> Hi.
>
> On Mon, Jul 26, 2010 at 12:00 PM, cstratowa
>
>
>
> <christian.strat...@vie.boehringer-ingelheim.com> wrote:
> > Dear Henrik,
>
> > Maybe, my explanation was not clear enough:
>
> > I have created my own package based on S4 classes, where one subclass
> > is "AromaSNP" with slots celset, normset, plmset, effectset as lists,
> > and methods readSNPData(), normalizeSNPData(), computeCN(),
> > computeRawCN(), among others. Furthermore, the package includes
> > scripts batch.aroma.norm.R, batch.aroma.model.R,
> > batch.aroma.combine.R, and a perl script which distributes these
> > scripts to the different cluster nodes.
>
> > 1, Normalization: Script batch.aroma.norm.R creates first the
> > subdirectory structure which I have already described, and then does
> > the normalization. All normalization steps run on one server and the
> > results are saved as AromaSNP object "aroma" in "Prostate/
> > Prostate.Rdata". Furthermore, subdirectories "Prostate/probeData" and
> > "Prostate/plmData" are created.
>
> > 2, GLAD: Script batch.aroma.norm.R is called from each node
> > separately. For each node it creates first a plmData subdirectory,
> > e.g. "Prostate/Prostate21/plmData" and makes symbolic links to the
> > normalized CEL-files located in "Prostate/plmData". Then it loads
> > object "aroma" from "Prostate/Prostate.Rdata", whereby each node has a
> > separate RAM of 2GB. Slot "ar...@effectset" contains the normalized
> > data and is called from computeCN() as "cesList <- ar...@effectset".
> > This "cesList" (which is in the RAM of each node) is passed to "model
> > <- GladModel(cesList, refList)", and is thus used to compute
> > getAverageFile(), if refList=NULL (which is the default). Since each
> > node calls the same "cesList" object, function saveObject() writes to
> > identical filenames which causes the clash.
>
> > Here is the relevant code for each cluster node:
>
> > cesList <- ar...@effectset
> > refList <- NULL  #but see below
> > model <- GladModel(cesList, refList)
> > ce  <- ChromosomeExplorer(model, tags=tags)
> > setParallelSafe(ce, status=TRUE) #suggested by you long time ago
> > tmp <- process(ce,...)
> > cnrs <- getRegions(model, ...)
> > ar...@cnregion <- cnrs
>
> > Each node saves the result, i.e. object "aroma", in a different Rdata
> > file, e.g. "Prostate/Prostate.GLAD.21.Rdata" and saves the images in
> > its own reports subdirectory, e.g. "Prostate/Prostate21/reports".
>
> > 3, Script batch.aroma.combine.R combines the results of each Rdata
> > file in a final file "Prostate/Prostate.GLAD.All.Rdata", combines all
> > images from the different subdirectories in "Prostate/reports", and
> > finally deletes all temporary Rdata files and subdirectories.
>
> > Since you mention that I should call getAverageFile() before calling
> > GladModel(), here is one more information:
> > All necessary information is stored in a text-file "pheno.txt", which
> > is read initially and contains the names of the CEL-files, the
> > location of each CEL-file, an alias name for each CEL-file, and
> > optionally columns "Reference" or "Pairs". By default "refList <-
> > NULL", however, when I activate option "Reference" then all CEL-files
> > with "1" in column "Reference" are used as reference. Here is the
> > relevant code:
>
> > cesList <- ar...@effectset
> > refList <- list();
> > for (chiptype in names(cesList)) {
> >   ces <- cesList[[chiptype]];
> >   refcol <- which(pheno[,reference] == 1);
> >   datcol <- which(pheno[,reference] == 0);
> >   if (reference == "Pairs") {
> >      cesList[[chiptype]] <- extract(ces, datcol);
> >      refList[[chiptype]] <- extract(ces, refcol);
> >   } else {
> >      cesRef <- extract(ces, refcol);
> >      ceRef  <- getAverageFile(cesRef);
> >      ## convert single ceRef file into a set of identical files of
> > same size/length as ces
> >      numarr <- nbrOfArrays(ces);
> >      cesRef <- rep(list(ceRef), numarr);
> >      cesRef <- newInstance(ces, cesRef, mergeStrands=ces
> > $mergeStrands, combineAlleles=ces$combineAlleles);
>
> FYI, aroma.* has been updated, so I think you can drop the latter
> three rows, and simply use:
>
> >      refList[[chiptype]] <- ceRef;
>
> It detect this and do that "trick" internally.
>
> >      refList[[chiptype]] <- cesRef;
> >   }#if
> > }#for
> > model <- GladModel(cesList, refList);
>
> > Thus, if really necessary, I can always create "pheno.txt" with
> > Reference column "1" for all CEL-files in order to call
> > getAverageFile(). However, I must admit that it is still not clear to
> > me what the advantage would be, especially now that you have removed
> > function saveObject().
>
> So, what I am trying to say is that:
>
> 1. GladModel(cesList, refList) uses a reference 'refList'.
> 2. If refList=NULL, then it correponds to using refList <-
> lapply(cesList, FUN=getAverageFile).
> 3. You can also build your own refList.  Unless you do a paired
> analysis, refList will probably be calculated by using
> getAverageFile() on some data set.  In your example, you extract a
> subset (of cesList <- ar...@effectset) that are reference samples
> ('cesRef').
>
> In all cases, whatever set of reference samples you use, they must be
> available at the time of the calculation getAverageFile().  That will
> always be the case.  However, the output/result file generated by
> getAverageFile() is not always available.  First time one of your
> processes completes a getAverageFile() call, a new file will be
> created and stored on your file system.  It's name will be a md5
> checksum that is generated from the names of the arrays in the set
> that you call getAverageFile() on.  If you do it twice for the same
> set of arrays, you will the second time get the results stored on
> file, because they have already been calculated.
>
> So far so good, the race condition occurs when you have two processes
> A and B that operates on the same data set 'cesList'.  Process A runs
> the script, it request the reference which is missing and starts
> running getAverageFile(cesList[[1]]).  While this is done, Process B
> starts doing the same thing, and since the *result file* of
> getAverageFile(cesList[[1]]) is not available, it starts doing the
> same thing.  Now Process A finish and writes its result file.  Later
> Process B writes its results to the same result file, because they
> process the same data set, more precisely getNames(cesList[[1]]) are
> the same.  If Process B starts writing at the same time as Process A
> writes, there is a potential problem.
>
> From my troubleshooting, as far as I understands it, the only way you
> could have gotten that error message was when two or more processes
> did getAverageFile(cesList[[1]]) where getNames(cesList[[1]]) where
> identical.  Are you 100% sure that is not the case? Are you saying
> that is not the case?  If not, I am really puzzled how there could be
> a clash in the first place.  Thus, the key point is to make sure that
> multiple processing are not trying to calculate getAverageFile() on
> the same array set at the same time.
>
> /Henrik
>
>
>
> > I hope that this explanation could explain better what the different
> > steps are.
>
> > Best regards
> > Christian
>
> > On Jul 23, 4:35 pm, Henrik Bengtsson <henrik.bengts...@gmail.com>
> > wrote:
> >> Hi.
>
> >> On Jul 22, 10:24 am, cstratowa <christian.strat...@vie.boehringer-
>
> >> ingelheim.com> wrote:
> >> > Dear Henrik,
>
> >> > Thank you very much for changing the code for getAverageFile(), I will
> >> > try it and let you know.
>
> >> > Thank you also for the explanation of writing to a temporary file, now
> >> > I understand your intention.
>
> >> > Regarding race conditions: No, I do not assume that aroma.* takes care
> >> > of potential race conditions. Here is what I do:
>
> >> > Assume that I have downloaded from GEO a prostate cancer dataset
> >> > consisting of 40 CEL-files. Then I create a directory "Prostate" and
> >> > subdirectories "Prostate/annotationData" and "Prostate/rawData"
> >> > following your required file structure.
>
> >> >  However, starting with the 2nd CEL-file I create subdirectories
> >> > "Prostate/Prostate2",...,"Prostate/Prostate40", each containing a
> >> > symbolic link to "../annotationData" and "../rawData" from "Prostate".
>
> >> Do I understand you correctly that you use a separate "project"
> >> directory for each CEL file, so that when you process the data you get
> >> separate subdirectories probeData/ and plmData/ in each of these
> >> project directories?
>
> >> > Thus when running GLAD each cluster node has its own directory to
> >> > write to, e.g. "Prostate/Prostate21/reports" for creating the images.
>
> >> This is where I get lost.  In order to do CN segmentation (here GLAD),
> >> you need to calculate CN ratios relative to a reference.  Looking at
> >> your error message, that reference is calculated from the pool of
> >> samples, i.e. getAverageFile() is done on the pool of references.
> >> Thus, for this to make sense you need a *pool of samples*, but if I
> >> understood you correctly above, you don't have that, but only one
> >> array per project directory.  I guess I misunderstood you, because
> >> your error indicates something else.
>
> >> The only way the error you got occurred was because multiple R
> >> sessions tried to run getAverageFile(ces) on data sets that contain
> >> arrays with the same names and in the same order (more precisely
> >> getNames(ces)).  If they would contain different array names, there
> >> would be no clash, because that saveObject() statement (that I just
> >> removed) would write to different filenames.  This makes me suspect
> >> that you indeed use the same pool of reference samples.
>
> >> > Only after all nodes have finished their computations, then I move the
> >> > relevant files to the main directory, e.g. all images are moved to
> >> > "Prostate/reports". Afterwards I delete the subdirectories
> >> > "Prostate2",...,"Prostate40" and their contents.
>
> >> > As you can see, using this setup there should not be any race
> >> > conditions. The only remaining problem are the temporary files which
> >> > you store in ".Rcache" in my home directory.
>
> >> So, there is something I don't understand above.  Can you post you
> >> full script,
>
> ...
>
> read more »

-- 
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group with website http://www.aroma-project.org/.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe and other options, go to http://www.aroma-project.org/forum/

[aroma.affymetrix] Re: Problem with GLAD on linux cluster

Reply via email to