[aroma.affymetrix] Re: Problem with GLAD on linux cluster

Henrik Bengtsson Fri, 23 Jul 2010 07:35:29 -0700

Hi.

On Jul 22, 10:24 am, cstratowa <christian.strat...@vie.boehringer-
ingelheim.com> wrote:
> Dear Henrik,
>
> Thank you very much for changing the code for getAverageFile(), I will
> try it and let you know.
>
> Thank you also for the explanation of writing to a temporary file, now
> I understand your intention.
>
> Regarding race conditions: No, I do not assume that aroma.* takes care
> of potential race conditions. Here is what I do:
>
> Assume that I have downloaded from GEO a prostate cancer dataset
> consisting of 40 CEL-files. Then I create a directory "Prostate" and
> subdirectories "Prostate/annotationData" and "Prostate/rawData"
> following your required file structure.
>
>  However, starting with the 2nd CEL-file I create subdirectories
> "Prostate/Prostate2",...,"Prostate/Prostate40", each containing a
> symbolic link to "../annotationData" and "../rawData" from "Prostate".


Do I understand you correctly that you use a separate "project"
directory for each CEL file, so that when you process the data you get
separate subdirectories probeData/ and plmData/ in each of these
project directories?

> Thus when running GLAD each cluster node has its own directory to
> write to, e.g. "Prostate/Prostate21/reports" for creating the images.

This is where I get lost.  In order to do CN segmentation (here GLAD),
you need to calculate CN ratios relative to a reference.  Looking at
your error message, that reference is calculated from the pool of
samples, i.e. getAverageFile() is done on the pool of references.
Thus, for this to make sense you need a *pool of samples*, but if I
understood you correctly above, you don't have that, but only one
array per project directory.  I guess I misunderstood you, because
your error indicates something else.

The only way the error you got occurred was because multiple R
sessions tried to run getAverageFile(ces) on data sets that contain
arrays with the same names and in the same order (more precisely
getNames(ces)).  If they would contain different array names, there
would be no clash, because that saveObject() statement (that I just
removed) would write to different filenames.  This makes me suspect
that you indeed use the same pool of reference samples.

> Only after all nodes have finished their computations, then I move the
> relevant files to the main directory, e.g. all images are moved to
> "Prostate/reports". Afterwards I delete the subdirectories
> "Prostate2",...,"Prostate40" and their contents.
>
> As you can see, using this setup there should not be any race
> conditions. The only remaining problem are the temporary files which
> you store in ".Rcache" in my home directory.

So, there is something I don't understand above.  Can you post you
full script, because that would certainly remove some of the
ambiguities.

Also, it helps if change your script to be explicit about the
getAverageFile() calculation, i.e.

print(cesN);
ceR <- getAverageFile(cesN);
print(ceR);
seg <- GladModel(cesN, ceR);
print(seg);

instead of letting GladModel() do it implicitly:

seg <- GladModel(cesN);
print(seg);

As explained above, if your parallelized R sessions calculate ceR <-
getAverageFile(cesN) on the same 'cesN" data set they will try to
generated the same 'ceR' result file, and you have a race condition.

>
> I know that you store the monocell files in ".Rcache/
> aroma.affymetrix", so that the monocell files have to be created only
> once.

Actually, the monocell *CDF* is stored in the corresponding
annotationData/chipTypes/<chipType>/ directory.

What is stored in .Rcache/ is main for performance purpose, i.e. we
use it for memoization [http://en.wikipedia.org/wiki/Memoization].
Moreover, we mostly use it for memoization of annotation data, because
that type of information is likely to be requested multiple times for
the same chip types regardless of data set.  In order for memoization
to work well across R sessions and hosts, the .Rcache/ directory need
to be accessed globally.  We rarely use memoization for experimental
data, because that is typically only requested once (in the data sets
life time).

> However, for the temporary files please allow me to suggest that
> you create a temporary directory in your file structure, e.g.
> "Prostate/tmp", where these files are stored. In my case this would
> definitely solve my problem since each subdirectory would contain its
> own temporary directory, e.g. "Prostate/Prostate21/tmp". I do not know
> if this change would break any code or cause any problems, it is only
> a naive suggestion. What is your  opinion?

Your suggestion makes sense for dataset specific temporary files etc,
but again, I don't think that is the case here.  Instead I think we
are misunderstanding each other.  You script will help.

/Henrik

>
> Best regards
> Christian
>
> On Jul 21, 6:46 pm, Henrik Bengtsson <henrik.bengts...@gmail.com>
> wrote:
>
>
>
> > Hi Christian.
>
> > On Wed, Jul 21, 2010 at 2:59 PM, cstratowa
>
> > <christian.strat...@vie.boehringer-ingelheim.com> wrote:
> > > Dear Henrik,
>
> > > Thank you for this extensive explanation and sorry for the late reply
> > > but I was pretty busy.
>
> > > Yes, it did work before! As I mentioned with versions
> > > aroma.affymetrix_1.1.0 and earlier I have never had a  problem doing
> > > the analyses on cluster nodes.
>
> > > Looking at the source code of different versions of saveObject() I
> > > realize that using "saveObject(..,safe=FALSE)" would be the same as
> > > using saveObject() from R.utils_0.9.1. Thus in principle this could
> > > solve my problem. Is this correct?
>
> > > Sadly, method AffymetrixCelSet::getAverageFile() in
> > > aroma.affymetrix_1.6.2 does not allow to pass parameter "safe=FALSE"
> > > to saveObject(). Is it possible for you to change it?
>
> > I have decided to remove that debug code that calls saveObject(),
> > because it is not really needed anymore.  The main reason why I remove
> > it is because it is obsolete code.  The intention of that code snippet
> > in getAverageFile() was never to protect against race conditions (it
> > was just an unplanned side effect).
>
> > Until next release, you can get a patched version as:
>
> > library("aroma.affymetrix");
> > downloadPackagePatch("aroma.affymetrix");
>
> > Note, as I said in my previous reply, by processing (=here calling
> > getAverageFile() on) the same data set on multiple hosts, you are
> > potentially running into race conditions resulting in corrupt data.
> > You should at least be aware of it and understand why this is the
> > case.
>
> > > It is still not clear to me why you create first a temporary file
> > > which you then rename (although you mention power failures etc).
> > > However, would it be possible to add a random number to the temporary
> > > filename, e.g. "*.tmp.1948234", so that the problem with the existing
> > > temporary file could be avoided?
>
> > The main purpose of writing to a temporary file and then renaming is
> > to make sure that the file is complete.  If something happens while
> > writing the temporary file, the final file will not exist/be created.
> > If one would write to the final file from the beginning, there is no
> > way for us to know if the file was correctly created or not.  So,
> > writing via a temporary file, we effectively have a way of creating
> > files in one atomic action.
>
> > > Probably you only need to change line 59 to:
>
> > > pathnameT <- sprintf("%s.tmp.%i", pathname,
> > > as.integer(runif(1,1,99999999)))
>
> > In order not to corrupt the temporary file, we check if it already
> > exist as a protection for being overwritten/added to by another
> > process.  Yes, you could randomize the name of the temporary file,
> > lowering the risk of two hosts writing to the same temporary file.
> > However, when done, both hosts will try to rename their temporary
> > files to the same pathname.  If done at the same time, we still may
> > have problems.
>
> > > Regarding your suggestion to wrap getAverageFile() in Mutex calls I
> > > have no idea if there exists an R-package for this purpose. Neither
> > > Rmpi nor snow seem to be suitable for this purpose (at least  not
> > > without a complete re-write of my package).
>
> > Yes, I neither know of a functional mutex implementation in R.  You
> > can achieve some by utilizing the lock mechanisms of data base servers
> > (not SqlLite), but nothing ready is available to my knowledge.
>
> > Again, you seem to assume that aroma.* takes care of potential race
> > conditions for you - it does not.  It only tries to detect them
> > without warranty - and indeed, the reason why got the error in the
> > first place indicates that you are pushing the system and that race
> > conditions may very well happen.  If you run things in parallel and
> > you are updating/writing the *same data resource*, you should really
> > have protection against race conditions.  This is a generic problem
> > unrelated to aroma.*.
>
> > /Henrik
>
> > > One other question:
> > > Is it allowed to delete the contents of directory .Rcache/
> > > aroma.affymetrix/idChecks?
>
> > Yes, it should be safe to delete any .Rcache/ as long as no R session
> > is in the process of writing to it.  It's a cache containing redundant
> > information.
>
> > > Best regards
> > > Christian
>
> > > On Jul 2, 12:47 am, Henrik Bengtsson <h...@stat.berkeley.edu> wrote:
>
> > > > Hi Christian.
>
> > > > On Tue, Jun 29, 2010 at 3:39 PM, cstratowa
>
> > > > <christian.strat...@vie.boehringer-ingelheim.com> wrote:
> > > > > Dear Henrik,
>
> > > > > Until now I have used aroma.affymetrix_1.1.0 with R-2.8.1 and could
> > > > > run my analysis on our sge-cluster w/o any problems.
>
> > > > > Now I have upgraded to R-2.11.1 and to aroma.affymetrix_1.6.2 and are
> > > > > curently testing with 8 chips whether my package based on
> > > > > aroma.affymetrix still works on the cluster. The normalization step on
> > > > > a server did run fine, howeever, distributing the 8 samples on the
> > > > > cluster to run GladModel() resulted in the problem that 3 of 8 cluster
> > > > > nodes did stop with the following error message:
>
> > > > > Loading required package: GLAD
> > > > > ...
> > > > > Loading required package: RColorBrewer
> > > > > Loading required package: Cairo
> > > > > Error in list(`computeCN(aroma, model = model, arrays = arrays[i],
> > > > > chromosomes = 1:23, ref` = <environment>,  :
>
> > > > > [2010-06-29 15:08:49] Exception: Cannot save to file. Temporary file
> > > > > already exists: ~/.Rcache/aroma.affymetrix/idChecks/
> > > > > a1c33926939ee43fbed83ae69301d215.tmp
> > > > >  at throw(Exception(...))
> > > > >  at throw.default("Cannot save to file. Temporary file already
> > > > > exists: ", pathn
> > > > >  at throw("Cannot save to file. Temporary file already exists: ",
> > > > > pathnameT)
> > > > >  at saveObject.default(list(key = key, keyIds = lapply(key, digest2),
> > > > > id = id),
> > > > >  at saveObject(list(key = key, keyIds = lapply(key, digest2), id =
> > > > > id), idPathn
> > > > >  at getAverageFile.AffymetrixCelSet(ces, force = force, verbose =
> > > > > less(verbose)
> > > > >  at NextMethod(generic = "getAverageFile", object = this, indices =
> > > > > indices, ..
> > > > >  at getAverageFile.ChipEffectSet(ces, force = force, verbose =
> > > > > less(verbose))
> > > > >  at NextMethod(generic = "getAverageFile", object = this, ...)
> > > > >  at getAverageFile.SnpChipEffectSet(ces, force = force, verbose =
> > > > > less(verbose)
> > > > >  at NextMethod(generic = "getAverageFile", object = this, ...)
> > > > >  at getAverageFile.CnChipEffectS
> > > > > Calls: computeCN ... saveObject.default -> throw -> throw.default ->
> > > > > throw -> throw.Exception
> > > > > Execution halted
>
> > > > > Interestingly, on the other 5 nodes GladModel() seems to run fine.
>
> > > > > Do you have any idea what the reason for this problem might be?
>
> > > > This seems to be due to a race condition, because several processes
> > > > calls getAverageFile() on the same data set (set of data files).  It
> > > > has nothing to do with the GladModel - that is only calling
> > > > getAverageFile() in order to calculate the average signal across all
> > > > samples in the data set.
>
> > > > More precisely, in this particular case it is saveObject() of R.utils
> > > > that detects that there already exist a temporary file (added file
> > > > name extension *.tmp) that is currently being created and written to
> > > > by another process.  This temporary file is renamed to its final name
> > > > when done.  The reason why didn't observe it before is most likely
> > > > because this additional feature was added to saveObject() in R.utils
> > > > v1.2.4:
>
> > > > Version: 1.2.4 [2009-10-30]
> > > > o ROBUSTIFICATION: Lowered the risk for saveObject() to leave an
> > > >   imcomplete
>
> ...
>
> read more »

-- 
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group with website http://www.aroma-project.org/.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe and other options, go to http://www.aroma-project.org/forum/

[aroma.affymetrix] Re: Problem with GLAD on linux cluster

Reply via email to