[aroma.affymetrix] Re: Problem with GLAD on linux cluster

cstratowa Mon, 26 Jul 2010 03:00:50 -0700

Dear Henrik,

Maybe, my explanation was not clear enough:


I have created my own package based on S4 classes, where one subclass
is "AromaSNP" with slots celset, normset, plmset, effectset as lists,
and methods readSNPData(), normalizeSNPData(), computeCN(),
computeRawCN(), among others. Furthermore, the package includes
scripts batch.aroma.norm.R, batch.aroma.model.R,
batch.aroma.combine.R, and a perl script which distributes these
scripts to the different cluster nodes.


1, Normalization: Script batch.aroma.norm.R creates first the
subdirectory structure which I have already described, and then does
the normalization. All normalization steps run on one server and the
results are saved as AromaSNP object "aroma" in "Prostate/
Prostate.Rdata". Furthermore, subdirectories "Prostate/probeData" and
"Prostate/plmData" are created.


2, GLAD: Script batch.aroma.norm.R is called from each node
separately. For each node it creates first a plmData subdirectory,
e.g. "Prostate/Prostate21/plmData" and makes symbolic links to the
normalized CEL-files located in "Prostate/plmData". Then it loads
object "aroma" from "Prostate/Prostate.Rdata", whereby each node has a
separate RAM of 2GB. Slot "ar...@effectset" contains the normalized
data and is called from computeCN() as "cesList <- ar...@effectset".
This "cesList" (which is in the RAM of each node) is passed to "model
<- GladModel(cesList, refList)", and is thus used to compute
getAverageFile(), if refList=NULL (which is the default). Since each
node calls the same "cesList" object, function saveObject() writes to
identical filenames which causes the clash.

Here is the relevant code for each cluster node:

cesList <- ar...@effectset
refList <- NULL  #but see below
model <- GladModel(cesList, refList)
ce  <- ChromosomeExplorer(model, tags=tags)
setParallelSafe(ce, status=TRUE) #suggested by you long time ago
tmp <- process(ce,...)
cnrs <- getRegions(model, ...)
ar...@cnregion <- cnrs

Each node saves the result, i.e. object "aroma", in a different Rdata
file, e.g. "Prostate/Prostate.GLAD.21.Rdata" and saves the images in
its own reports subdirectory, e.g. "Prostate/Prostate21/reports".


3, Script batch.aroma.combine.R combines the results of each Rdata
file in a final file "Prostate/Prostate.GLAD.All.Rdata", combines all
images from the different subdirectories in "Prostate/reports", and
finally deletes all temporary Rdata files and subdirectories.


Since you mention that I should call getAverageFile() before calling
GladModel(), here is one more information:
All necessary information is stored in a text-file "pheno.txt", which
is read initially and contains the names of the CEL-files, the
location of each CEL-file, an alias name for each CEL-file, and
optionally columns "Reference" or "Pairs". By default "refList <-
NULL", however, when I activate option "Reference" then all CEL-files
with "1" in column "Reference" are used as reference. Here is the
relevant code:

cesList <- ar...@effectset
refList <- list();
for (chiptype in names(cesList)) {
   ces <- cesList[[chiptype]];
   refcol <- which(pheno[,reference] == 1);
   datcol <- which(pheno[,reference] == 0);
   if (reference == "Pairs") {
      cesList[[chiptype]] <- extract(ces, datcol);
      refList[[chiptype]] <- extract(ces, refcol);
   } else {
      cesRef <- extract(ces, refcol);
      ceRef  <- getAverageFile(cesRef);
      ## convert single ceRef file into a set of identical files of
same size/length as ces
      numarr <- nbrOfArrays(ces);
      cesRef <- rep(list(ceRef), numarr);
      cesRef <- newInstance(ces, cesRef, mergeStrands=ces
$mergeStrands, combineAlleles=ces$combineAlleles);
      refList[[chiptype]] <- cesRef;
   }#if
}#for
model <- GladModel(cesList, refList);

Thus, if really necessary, I can always create "pheno.txt" with
Reference column "1" for all CEL-files in order to call
getAverageFile(). However, I must admit that it is still not clear to
me what the advantage would be, especially now that you have removed
function saveObject().

I hope that this explanation could explain better what the different
steps are.

Best regards
Christian


On Jul 23, 4:35 pm, Henrik Bengtsson <henrik.bengts...@gmail.com>
wrote:
> Hi.
>
> On Jul 22, 10:24 am, cstratowa <christian.strat...@vie.boehringer-
>
>
>
> ingelheim.com> wrote:
> > Dear Henrik,
>
> > Thank you very much for changing the code for getAverageFile(), I will
> > try it and let you know.
>
> > Thank you also for the explanation of writing to a temporary file, now
> > I understand your intention.
>
> > Regarding race conditions: No, I do not assume that aroma.* takes care
> > of potential race conditions. Here is what I do:
>
> > Assume that I have downloaded from GEO a prostate cancer dataset
> > consisting of 40 CEL-files. Then I create a directory "Prostate" and
> > subdirectories "Prostate/annotationData" and "Prostate/rawData"
> > following your required file structure.
>
> >  However, starting with the 2nd CEL-file I create subdirectories
> > "Prostate/Prostate2",...,"Prostate/Prostate40", each containing a
> > symbolic link to "../annotationData" and "../rawData" from "Prostate".
>
> Do I understand you correctly that you use a separate "project"
> directory for each CEL file, so that when you process the data you get
> separate subdirectories probeData/ and plmData/ in each of these
> project directories?
>
> > Thus when running GLAD each cluster node has its own directory to
> > write to, e.g. "Prostate/Prostate21/reports" for creating the images.
>
> This is where I get lost.  In order to do CN segmentation (here GLAD),
> you need to calculate CN ratios relative to a reference.  Looking at
> your error message, that reference is calculated from the pool of
> samples, i.e. getAverageFile() is done on the pool of references.
> Thus, for this to make sense you need a *pool of samples*, but if I
> understood you correctly above, you don't have that, but only one
> array per project directory.  I guess I misunderstood you, because
> your error indicates something else.
>
> The only way the error you got occurred was because multiple R
> sessions tried to run getAverageFile(ces) on data sets that contain
> arrays with the same names and in the same order (more precisely
> getNames(ces)).  If they would contain different array names, there
> would be no clash, because that saveObject() statement (that I just
> removed) would write to different filenames.  This makes me suspect
> that you indeed use the same pool of reference samples.
>
> > Only after all nodes have finished their computations, then I move the
> > relevant files to the main directory, e.g. all images are moved to
> > "Prostate/reports". Afterwards I delete the subdirectories
> > "Prostate2",...,"Prostate40" and their contents.
>
> > As you can see, using this setup there should not be any race
> > conditions. The only remaining problem are the temporary files which
> > you store in ".Rcache" in my home directory.
>
> So, there is something I don't understand above.  Can you post you
> full script, because that would certainly remove some of the
> ambiguities.
>
> Also, it helps if change your script to be explicit about the
> getAverageFile() calculation, i.e.
>
> print(cesN);
> ceR <- getAverageFile(cesN);
> print(ceR);
> seg <- GladModel(cesN, ceR);
> print(seg);
>
> instead of letting GladModel() do it implicitly:
>
> seg <- GladModel(cesN);
> print(seg);
>
> As explained above, if your parallelized R sessions calculate ceR <-
> getAverageFile(cesN) on the same 'cesN" data set they will try to
> generated the same 'ceR' result file, and you have a race condition.
>
>
>
> > I know that you store the monocell files in ".Rcache/
> > aroma.affymetrix", so that the monocell files have to be created only
> > once.
>
> Actually, the monocell *CDF* is stored in the corresponding
> annotationData/chipTypes/<chipType>/ directory.
>
> What is stored in .Rcache/ is main for performance purpose, i.e. we
> use it for memoization [http://en.wikipedia.org/wiki/Memoization].
> Moreover, we mostly use it for memoization of annotation data, because
> that type of information is likely to be requested multiple times for
> the same chip types regardless of data set.  In order for memoization
> to work well across R sessions and hosts, the .Rcache/ directory need
> to be accessed globally.  We rarely use memoization for experimental
> data, because that is typically only requested once (in the data sets
> life time).
>
> > However, for the temporary files please allow me to suggest that
> > you create a temporary directory in your file structure, e.g.
> > "Prostate/tmp", where these files are stored. In my case this would
> > definitely solve my problem since each subdirectory would contain its
> > own temporary directory, e.g. "Prostate/Prostate21/tmp". I do not know
> > if this change would break any code or cause any problems, it is only
> > a naive suggestion. What is your  opinion?
>
> Your suggestion makes sense for dataset specific temporary files etc,
> but again, I don't think that is the case here.  Instead I think we
> are misunderstanding each other.  You script will help.
>
> /Henrik
>
>
>
> > Best regards
> > Christian
>
> > On Jul 21, 6:46 pm, Henrik Bengtsson <henrik.bengts...@gmail.com>
> > wrote:
>
> > > Hi Christian.
>
> > > On Wed, Jul 21, 2010 at 2:59 PM, cstratowa
>
> > > <christian.strat...@vie.boehringer-ingelheim.com> wrote:
> > > > Dear Henrik,
>
> > > > Thank you for this extensive explanation and sorry for the late reply
> > > > but I was pretty busy.
>
> > > > Yes, it did work before! As I mentioned with versions
> > > > aroma.affymetrix_1.1.0 and earlier I have never had a  problem doing
> > > > the analyses on cluster nodes.
>
> > > > Looking at the source code of different versions of saveObject() I
> > > > realize that using "saveObject(..,safe=FALSE)" would be the same as
> > > > using saveObject() from R.utils_0.9.1. Thus in principle this could
> > > > solve my problem. Is this correct?
>
> > > > Sadly, method AffymetrixCelSet::getAverageFile() in
> > > > aroma.affymetrix_1.6.2 does not allow to pass parameter "safe=FALSE"
> > > > to saveObject(). Is it possible for you to change it?
>
> > > I have decided to remove that debug code that calls saveObject(),
> > > because it is not really needed anymore.  The main reason why I remove
> > > it is because it is obsolete code.  The intention of that code snippet
> > > in getAverageFile() was never to protect against race conditions (it
> > > was just an unplanned side effect).
>
> > > Until next release, you can get a patched version as:
>
> > > library("aroma.affymetrix");
> > > downloadPackagePatch("aroma.affymetrix");
>
> > > Note, as I said in my previous reply, by processing (=here calling
> > > getAverageFile() on) the same data set on multiple hosts, you are
> > > potentially running into race conditions resulting in corrupt data.
> > > You should at least be aware of it and understand why this is the
> > > case.
>
> > > > It is still not clear to me why you create first a temporary file
> > > > which you then rename (although you mention power failures etc).
> > > > However, would it be possible to add a random number to the temporary
> > > > filename, e.g. "*.tmp.1948234", so that the problem with the existing
> > > > temporary file could be avoided?
>
> > > The main purpose of writing to a temporary file and then renaming is
> > > to make sure that the file is complete.  If something happens while
> > > writing the temporary file, the final file will not exist/be created.
> > > If one would write to the final file from the beginning, there is no
> > > way for us to know if the file was correctly created or not.  So,
> > > writing via a temporary file, we effectively have a way of creating
> > > files in one atomic action.
>
> > > > Probably you only need to change line 59 to:
>
> > > > pathnameT <- sprintf("%s.tmp.%i", pathname,
> > > > as.integer(runif(1,1,99999999)))
>
> > > In order not to corrupt the temporary file, we check if it already
> > > exist as a protection for being overwritten/added to by another
> > > process.  Yes, you could randomize the name of the temporary file,
> > > lowering the risk of two hosts writing to the same temporary file.
> > > However, when done, both hosts will try to rename their temporary
> > > files to the same pathname.  If done at the same time, we still may
> > > have problems.
>
> > > > Regarding your suggestion to wrap getAverageFile() in Mutex calls I
> > > > have no idea if there exists an R-package for this purpose. Neither
> > > > Rmpi nor snow seem to be suitable for this purpose (at least  not
> > > > without a complete re-write of my package).
>
> > > Yes, I neither know of a functional mutex implementation in R.  You
> > > can achieve some by utilizing the lock mechanisms of data base servers
> > > (not SqlLite), but nothing ready is available to my knowledge.
>
> > > Again, you seem to assume that aroma.* takes care of potential race
> > > conditions for you - it does not.  It only tries to detect them
> > > without warranty - and indeed, the reason why got the error in the
> > > first place indicates that you are pushing the system and that race
> > > conditions may very well happen.  If you run things in parallel and
> > > you are updating/writing the *same data resource*, you should really
> > > have protection against race conditions.  This is a generic problem
> > > unrelated to aroma.*.
>
> > > /Henrik
>
> > > > One other question:
> > > > Is it allowed to delete the contents of directory .Rcache/
> > > > aroma.affymetrix/idChecks?
>
> > > Yes, it should be safe to delete any .Rcache/ as long as no R session
> > > is in the process of writing to it.  It's a cache containing redundant
> > > information.
>
> > > > Best regards
> > > > Christian
>
> > > > On Jul 2, 12:47 am, Henrik Bengtsson <h...@stat.berkeley.edu> wrote:
>
> > > > > Hi Christian.
>
> > > > > On Tue, Jun 29, 2010 at 3:39 PM, cstratowa
>
> > > > > <christian.strat...@vie.boehringer-ingelheim.com> wrote:
> > > > > > Dear Henrik,
>
> > > > > > Until now I have used aroma.affymetrix_1.1.0 with R-2.8.1 and could
> > > > > > run my analysis on our sge-cluster w/o any problems.
>
> > > > > > Now I have upgraded to R-2.11.1 and to
>
> ...
>
> read more »

-- 
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group with website http://www.aroma-project.org/.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe and other options, go to http://www.aroma-project.org/forum/

[aroma.affymetrix] Re: Problem with GLAD on linux cluster

Reply via email to