Dear Henrik, Thank you, this discussion was very helpful for me.
Your suggestion to compute getAverageFile() first is a good idea, I will try it when I find time to change my code again. It is also good to know that I can change ".Rcache" by using setCacheRootPath(). Best regards Christian On Aug 4, 10:21 am, Henrik Bengtsson <henrik.bengts...@gmail.com> wrote: > Hi Christian, > > On Wed, Aug 4, 2010 at 9:04 AM, cstratowa > > <christian.strat...@vie.boehringer-ingelheim.com> wrote: > > Dear Henrik, > > > Thank you for your suggestion to use ceRef directly. > > > Regarding your explanation of getAverageFile() the question is where > > the generated output will be saved. > > > As I have mentioned, each node creates first a plmData subdirectory, > > e.g. "Prostate/Prostate21/plmData" and makes symbolic links to the > > normalized CEL-files located in "Prostate/plmData". Thus the output of > > getAverageFile() should be stored for each node separately. > > Ah, now I see; I've been reading it as you were linking the > directories, not the individual CEL files. > > > > > This seems indeed to be the case, since e.g. the subdirectory > > "Prostate/Prostate21/plmData/Prostate,ACC,-XY,QN,RMA,A+B,FLN,-XY/ > > Mapping250K_Nsp" contains the file ".average-intensities-median- > > mad,a1c33926939ee43fbed83ae69301d215.CEL" created at a certain time > > while subdirectory "Prostate/Prostate8/plmData/Prostate,ACC,- > > XY,QN,RMA,A+B,FLN,-XY/Mapping250K_Nsp" contains a file with the same > > name, i.e. ".average-intensities-median- > > mad,a1c33926939ee43fbed83ae69301d215.CEL" created at a different > > time. > > Yes. > > As I understand it now, you preprocess all of the data, and wait for > everything to be done (all *,chipEffects.CEL files to be generated) > before continuing with the above, correct? If so, I'd suggest that > you also wait for getAverageFile() to finish first. Then that average/ > results file be available to all your cluster nodes as well. I even > think you don't have to link each CEL file separately, because nothing > else should be written back to the data set. It should be enough to > link each data set directory, or even just plmData/ itself (not even > sure the need to split it up anymore). > > > > > As far as I understand these are the files created by getAverageFile() > > and thus each node creates its own file saved in its own subdirectory, > > so there will be no problem. > > Yes. Now I agree with you. > > > > > It seems that the problem was indeed the result of saveObject() stored > > in ".Rcache", which caused the race conditions. Since the removal of > > saveObject() I have until now experienced no problems. > > Yes. You are correct. > > Since caching is mainly done for memoization purposes, that is, to > load already calculated results that are computational expensive to > obtain from file, it is recommended to store the cache in a fast > place. In other words, it is better if the .Rcache directory is on > the local drive of the machine, rather than on a shared file system. > If you had done that, then each machine would had to have do those > calculations by themselves once, but when done the memoization would > be faster and you would not have had any race conditions accessing the > memoized results. The default ~/.Rcache/ can be changed, > cf.http://www.aroma-project.org/archive/GoogleGroups/web/caching. > > This was a useful conversation to me; it made me see other ways for > (unnecessary) race conditions to occur, and remind me how important it > is to not overlook the smallest details in scientific communication > since they can make big differences. > > Cheers, > > Henrik > > > > > Thank you for your help. > > Best regards > > Christian > > > On Aug 2, 2:54 pm, Henrik Bengtsson <h...@stat.berkeley.edu> wrote: > > > > Hi. > > > > On Mon, Jul 26, 2010 at 12:00 PM, cstratowa > > > > <christian.strat...@vie.boehringer-ingelheim.com> wrote: > > > > Dear Henrik, > > > > > Maybe, my explanation was not clear enough: > > > > > I have created my own package based on S4 classes, where one subclass > > > > is "AromaSNP" with slots celset, normset, plmset, effectset as lists, > > > > and methods readSNPData(), normalizeSNPData(), computeCN(), > > > > computeRawCN(), among others. Furthermore, the package includes > > > > scripts batch.aroma.norm.R, batch.aroma.model.R, > > > > batch.aroma.combine.R, and a perl script which distributes these > > > > scripts to the different cluster nodes. > > > > > 1, Normalization: Script batch.aroma.norm.R creates first the > > > > subdirectory structure which I have already described, and then does > > > > the normalization. All normalization steps run on one server and the > > > > results are saved as AromaSNP object "aroma" in "Prostate/ > > > > Prostate.Rdata". Furthermore, subdirectories "Prostate/probeData" and > > > > "Prostate/plmData" are created. > > > > > 2, GLAD: Script batch.aroma.norm.R is called from each node > > > > separately. For each node it creates first a plmData subdirectory, > > > > e.g. "Prostate/Prostate21/plmData" and makes symbolic links to the > > > > normalized CEL-files located in "Prostate/plmData". Then it loads > > > > object "aroma" from "Prostate/Prostate.Rdata", whereby each node has a > > > > separate RAM of 2GB. Slot "ar...@effectset" contains the normalized > > > > data and is called from computeCN() as "cesList <- ar...@effectset". > > > > This "cesList" (which is in the RAM of each node) is passed to "model > > > > <- GladModel(cesList, refList)", and is thus used to compute > > > > getAverageFile(), if refList=NULL (which is the default). Since each > > > > node calls the same "cesList" object, function saveObject() writes to > > > > identical filenames which causes the clash. > > > > > Here is the relevant code for each cluster node: > > > > > cesList <- ar...@effectset > > > > refList <- NULL #but see below > > > > model <- GladModel(cesList, refList) > > > > ce <- ChromosomeExplorer(model, tags=tags) > > > > setParallelSafe(ce, status=TRUE) #suggested by you long time ago > > > > tmp <- process(ce,...) > > > > cnrs <- getRegions(model, ...) > > > > ar...@cnregion <- cnrs > > > > > Each node saves the result, i.e. object "aroma", in a different Rdata > > > > file, e.g. "Prostate/Prostate.GLAD.21.Rdata" and saves the images in > > > > its own reports subdirectory, e.g. "Prostate/Prostate21/reports". > > > > > 3, Script batch.aroma.combine.R combines the results of each Rdata > > > > file in a final file "Prostate/Prostate.GLAD.All.Rdata", combines all > > > > images from the different subdirectories in "Prostate/reports", and > > > > finally deletes all temporary Rdata files and subdirectories. > > > > > Since you mention that I should call getAverageFile() before calling > > > > GladModel(), here is one more information: > > > > All necessary information is stored in a text-file "pheno.txt", which > > > > is read initially and contains the names of the CEL-files, the > > > > location of each CEL-file, an alias name for each CEL-file, and > > > > optionally columns "Reference" or "Pairs". By default "refList <- > > > > NULL", however, when I activate option "Reference" then all CEL-files > > > > with "1" in column "Reference" are used as reference. Here is the > > > > relevant code: > > > > > cesList <- ar...@effectset > > > > refList <- list(); > > > > for (chiptype in names(cesList)) { > > > > ces <- cesList[[chiptype]]; > > > > refcol <- which(pheno[,reference] == 1); > > > > datcol <- which(pheno[,reference] == 0); > > > > if (reference == "Pairs") { > > > > cesList[[chiptype]] <- extract(ces, datcol); > > > > refList[[chiptype]] <- extract(ces, refcol); > > > > } else { > > > > cesRef <- extract(ces, refcol); > > > > ceRef <- getAverageFile(cesRef); > > > > ## convert single ceRef file into a set of identical files of > > > > same size/length as ces > > > > numarr <- nbrOfArrays(ces); > > > > cesRef <- rep(list(ceRef), numarr); > > > > cesRef <- newInstance(ces, cesRef, mergeStrands=ces > > > > $mergeStrands, combineAlleles=ces$combineAlleles); > > > > FYI, aroma.* has been updated, so I think you can drop the latter > > > three rows, and simply use: > > > > > refList[[chiptype]] <- ceRef; > > > > It detect this and do that "trick" internally. > > > > > refList[[chiptype]] <- cesRef; > > > > }#if > > > > }#for > > > > model <- GladModel(cesList, refList); > > > > > Thus, if really necessary, I can always create "pheno.txt" with > > > > Reference column "1" for all CEL-files in order to call > > > > getAverageFile(). However, I must admit that it is still not clear to > > > > me what the advantage would be, especially now that you have removed > > > > function saveObject(). > > > > So, what I am trying to say is that: > > > > 1. GladModel(cesList, refList) uses a reference 'refList'. > > > 2. If refList=NULL, then it correponds to using refList <- > > > lapply(cesList, FUN=getAverageFile). > > > 3. You can also build your own refList. Unless you do a paired > > > analysis, refList will probably be calculated by using > > > getAverageFile() on some data set. In your example, you extract a > > > subset (of cesList <- ar...@effectset) that are reference samples > > > ('cesRef'). > > > > In all cases, whatever set of reference samples you use, they must be > > > available at the time of the calculation getAverageFile(). That will > > > always be the case. However, the output/result file generated by > > > getAverageFile() is not always available. First time one of your > > > processes completes a getAverageFile() call, a new file will be > > > created and stored on your file system. It's name will be a md5 > > > checksum that is generated from the names of the arrays in the set > > > that you call getAverageFile() on. If you do it twice for the same > > > set of arrays, you will the second time get the results stored on > > > file, because they have already been calculated. > > > > So far so good, the race condition occurs when you have two processes > > > A and B that operates on the same data set 'cesList'. Process A runs > > > the script, it request the reference which is missing and starts > > > running getAverageFile(cesList[[1]]). While this is done, Process B > > > starts doing the same thing, and since the *result file* of > > > getAverageFile(cesList[[1]]) is not available, it starts > > ... > > read more » -- When reporting problems on aroma.affymetrix, make sure 1) to run the latest version of the package, 2) to report the output of sessionInfo() and traceback(), and 3) to post a complete code example. You received this message because you are subscribed to the Google Groups "aroma.affymetrix" group with website http://www.aroma-project.org/. To post to this group, send email to aroma-affymetrix@googlegroups.com To unsubscribe and other options, go to http://www.aroma-project.org/forum/