Re: [aroma.affymetrix] problem using CRMA v2

Henrik Bengtsson Fri, 30 Nov 2012 13:09:46 -0800

For the record, here's an updated on this thread, which we took the
discussion and troubleshooting off line:


The problems was identified and solved.  The short story is that it
was an unfortunate and somewhat unlikely set of actions (using a
different CDF with the exact same name and then changing to use the
official one) that caused it to happen, but it could still happen.
The outcome of this report (thanks!), is that I've made updates so
that in the next release this will not happen again.

The long story is that the Aroma Framework utilizes so called
memoization (aka caching) to speed up the processing of repetitive
tasks.  For instance, when you query a CDF to identify all cell
indices for PM probes in all SNPs, it takes quite some time to scan
through the CDF (which you notice the first time you work with a new
chip type), but the second time you ask Aroma to do the same thing,
Aroma is clever enough to recognize that you've done this before, and
instead loads the results from a cache (on your file system).  In
order lookup results in the cache, Aroma looks at the input arguments,
where the CDF file is one of them, and generate a unique hash key.  In
the current version, this key is generated only from the filename of
the CDF (which caused the about key clash), but in the upcoming
version of the package it will be generated from the filename, the
filesize and the file MD5 checksum, which makes the key unique.

So, case closed.

/Henrik

On Tue, Nov 27, 2012 at 12:15 PM, Ioana Cutcutache
<ioana.cutcuta...@gmail.com> wrote:
> Hi Henrik,
>
> I just did what you suggested and seems it's all ok, the checksum is fine,
> output below.
> Any idea what else might be causing the problem?
>
>> cdf <- AffymetrixCdfFile$byChipType("GenomeWideSNP_6", tags="Full");
>> # ASSERT CORRECTNESS
>> stopifnot(identical(getChecksum(cdf), "3fbe0f6e7c8a346105238a3f3d10d4ec"))
>> cdfM <- getMonocellCdf(cdf);
>> print(cdfM);
> AffymetrixCdfFile:
> Path: annotationData/chipTypes/GenomeWideSNP_6
> Filename: GenomeWideSNP_6,Full,monocell.CDF
> Filesize: 416.27MB
> Chip type: GenomeWideSNP_6,Full,monocell
> RAM: 0.00MB
> File format: v4 (binary; XDA)
> Dimension: 1681x1681
> Number of cells: 2825761
> Number of units: 1881415
> Cells per unit: 1.50
> Number of QC units: 4
>> stopifnot(identical(getChecksum(cdfM),
>> "3a797e24a62f4f2e5214a6bf819151da"));
>> getChecksum(cdfM)
> [1] "3a797e24a62f4f2e5214a6bf819151da"
>
> Ioana
>
> On Tuesday, 27 November 2012 21:40:54 UTC+2, Henrik Bengtsson wrote:
>>
>> Hi,
>>
>> this certainly looks weird and it's very puzzling.  My last guess it
>> that your so called monocell CDF is corrupt (but I cannot see how that
>> happened).  Verify that you get the same:
>>
>> library("aroma.affymetrix");
>>
>> cdf <- AffymetrixCdfFile$byChipType("GenomeWideSNP_6", tags="Full");
>> # ASSERT CORRECTNESS
>> stopifnot(identical(getChecksum(cdf), "3fbe0f6e7c8a346105238a3f3d10d4ec"))
>>
>> cdfM <- getMonocellCdf(cdf);
>> print(cdfM);
>> ## AffymetrixCdfFile:
>> ## Path: annotationData/chipTypes/GenomeWideSNP_6
>> ## Filename: GenomeWideSNP_6,Full,monocell.CDF
>> ## File size: 416.27 MB (436492121 bytes)
>> ## Chip type: GenomeWideSNP_6,Full,monocell
>> ## RAM: 0.00MB
>> ## File format: v4 (binary; XDA)
>> ## Dimension: 1681x1681
>> ## Number of cells: 2825761
>> ## Number of units: 1881415
>> ## Cells per unit: 1.50
>> ## Number of QC units: 4
>>
>> # ASSERT CORRECTNESS
>> stopifnot(identical(getChecksum(cdfM),
>> "3a797e24a62f4f2e5214a6bf819151da"));
>>
>>
>> If you get an error at the latter step, then I recommend that you
>> remove that file by doing:
>>
>> pathnameM <- getPathname(cdfM);
>> file.remove(pathnameM);
>> stopifnot(!isFile(pathnameM));
>>
>> and then also clear the file cache for aroma.affymetrix by doing:
>>
>> path <- getCachePath(dirs="aroma.affymetrix");
>> removeDirectory(path, recursive=TRUE);
>>
>> Then regenerate the monocell CDF (it will take a long time):
>>
>> cdfM <- getMonocellCdf(cdf, verbose=-10);
>>
>> and again compare with the output I get, particularly:
>>
>> stopifnot(identical(getChecksum(cdfM),
>> "3a797e24a62f4f2e5214a6bf819151da"));
>>
>> If get to this point, it means that was the problem and you should be
>> able to start your analysis.
>>
>>
>> I'm really sorry that you get hit by this - it's really odd and it's
>> extremely rare that someone has to got through all this mess to get
>> things working - it should just work out of the box and hopefully it
>> will do after this.
>>
>> /Henrik
>>
>> On Tue, Nov 27, 2012 at 2:39 AM, Ioana Cutcutache
>> <ioana.cu...@gmail.com> wrote:
>> > I am still not able to get same as you for units (so if I continue doing
>> > units2 <- fit(plm, verbose=-10) I am not getting the right thing
>> > either).
>> > I ran everything again and here is the output of the last part:
>> >
>> >> plm <- AvgCnPlm(csN, mergeStrands=TRUE, combineAlleles=TRUE)
>> >> units1 <- fitCnProbes(plm, verbose=verbose);
>> > Estimating single-probe CN units...
>> >  Getting chip-effect set...
>> >   Getting chip-effect set from data set...
>> >   Getting chip-effect set from data set...done
>> >  Getting chip-effect set...done
>> >  Identifying CN units...
>> >    int [1:945826] 935590 935591 935592 935593 935594 935595 935596
>> > 935597
>> > 935598 935599 ...
>> >  Identifying CN units...done
>> >  Identifying subset to fit...
>> >   Identifying non-fitted units in chip-effect file...
>> >    Pathname:
>> >
>> > plmData/test,ACC,ra,-XY,BPN,-XY,AVG,A+B/GenomeWideSNP_6/91228050T,chipEffects.CEL
>> >    Found indices cached on file
>> >    Reading data for these 1881415 cells...
>> >    Reading data for these 1881415 cells...done
>> >    Looking for stdvs <= 0 indicating non-estimated units:
>> >     int [1:1881415] 1 2 3 4 5 6 7 8 9 10 ...
>> >   Identifying non-fitted units in chip-effect file...done
>> >    int [1:945826] 935590 935591 935592 935593 935594 935595 935596
>> > 935597
>> > 935598 935599 ...
>> >  Identifying subset to fit...done
>> >  Identifying cell indices for CN units...
>> >  Identifying cell indices for CN units...done
>> >  Keeping only single-cell units...
>> >   Single-cell units:
>> >    int [1:945826] 935590 935591 935592 935593 935594 935595 935596
>> > 935597
>> > 935598 935599 ...
>> >   Cell indices:
>> >    int [1:945826] 3435872 3403913 422133 2456228 4388533 5792832 3788106
>> > 3660159 3524469 3264141 ...
>> >  Keeping only single-cell units...done
>> >  Identifying cell indices for estimates...
>> >    int [1:945826] 1876054 1876055 1876056 1876057 1876058 1876059
>> > 1876060
>> > 1876061 1876062 1876063 ...
>> >  Identifying cell indices for estimates...done
>> >  Fitting 1 arrays...
>> >   Array #1 ('91228050T') of 1...
>> >    Reading signals...
>> >      num [1:945826] 6448 7260 7878 6151 12230 ...
>> >    Reading signals...done
>> >    Transforming signals to estimates...
>> >     'data.frame':       945826 obs. of  4 variables:
>> >      $ cell    : int  1876054 1876055 1876056 1876057 1876058 1876059
>> > 1876060 1876061 1876062 1876063 ...
>> >      $ theta   : num  6448 7260 7878 6151 12230 ...
>> >      $ sdTheta : num  1.49e-08 1.49e-08 1.49e-08 1.49e-08 1.49e-08 ...
>> >      $ outliers: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
>> >    Transforming signals to estimates...done
>> >    Writing estimates...
>> >     Storing flat data to file...
>> >      Updating file...
>> >      Updating file...done
>> >     Storing flat data to file...done
>> >    Writing estimates...done
>> >   Array #1 ('91228050T') of 1...done
>> >              used  (Mb) gc trigger  (Mb)  max used  (Mb)
>> >   Ncells   483738  25.9    1733216  92.6  12913475 689.7
>> >   Vcells 28397870 216.7   83594476 637.8 130588565 996.4
>> >  Fitting 1 arrays...done
>> > Estimating single-probe CN units...done
>> >> str(units1);
>> >  int [1:945826] 935590 935591 935592 935593 935594 935595 935596 935597
>> > 935598 935599 ...
>> >> units <- findUnitsTodo(plm);
>> >> str(units);
>> >  int [1:935589] 1 2 3 4 5 6 7 8 9 10 ...
>> >
>> > On Monday, 26 November 2012 23:08:47 UTC+2, Henrik Bengtsson wrote:
>> >>
>> >> On Mon, Nov 26, 2012 at 12:46 PM, Ioana Cutcutache
>> >> <ioana.cu...@gmail.com> wrote:
>> >> > Ok, so the first part ran without any errors, output below:
>> >> >
>> >> >> print(csN);
>> >> > AffymetrixCelSet:
>> >> > Name: test
>> >> > Tags: ACC,ra,-XY,BPN,-XY
>> >> > Path: probeData/test,ACC,ra,-XY,BPN,-XY/GenomeWideSNP_6
>> >> > Platform: Affymetrix
>> >> > Chip type: GenomeWideSNP_6,Full
>> >> > Number of arrays: 1
>> >> > Names: 91228050T [1]
>> >> > Time period: 2011-07-19 05:08:51 -- 2011-07-19 05:08:51
>> >> > Total file size: 65.83MB
>> >> > RAM: 0.01MB
>> >> >> print(getFullNames(csN));
>> >> > [1] "91228050T"
>> >> >> cdf2 <- getCdf(csN);
>> >> >> stopifnot(identical(getChecksum(cdf2), getChecksum(cdf))); #
>> >> >> IMPORTANT
>> >> >>
>> >> >
>> >> > However I am not getting same as you for the units:
>> >> >
>> >> >> plm <- AvgCnPlm(csN, mergeStrands=TRUE, combineAlleles=TRUE);
>> >> >> units <- findUnitsTodo(plm);
>> >> >> str(units);
>> >> >  int [1:1881415] 1 2 3 4 5 6 7 8 9 10 ...
>> >>
>> >> Alright (but has nothing to do with the error you first reported).  If
>> >> so,
>> >> do:
>> >>
>> >> units1 <- fitCnProbes(plm, verbose=verbose);
>> >> str(units1);
>> >> ##  int [1:945826] 935590 935591 935592 935593 935594 935595 935596
>> >> 935597 935598 935599 ...
>> >>
>> >> and you should get the same as:
>> >>
>> >> units <- findUnitsTodo(plm);
>> >> str(units);
>> >> ##  int [1:934968] 622 623 624 625 626 627 628 629 630 631 ...
>> >>
>> >> If so, the try with
>> >>
>> >> units2 <- fit(plm, verbose=-10);
>> >>
>> >> and compare to the output I already pasted in my previous message
>> >> (below).
>> >>
>> >> /Henrik
>> >>
>> >> >
>> >> >
>> >> > On Monday, 26 November 2012 21:15:32 UTC+2, Henrik Bengtsson wrote:
>> >> >>
>> >> >> Hmm... first run this in a fresh R session:
>> >> >>
>> >> >> library("aroma.affymetrix");
>> >> >> cdf <- AffymetrixCdfFile$byChipType("GenomeWideSNP_6", tags="Full");
>> >> >> csR <- AffymetrixCelSet$byName("test", cdf=cdf);
>> >> >> acc <- AllelicCrosstalkCalibration(csR, model="CRMAv2");
>> >> >> csC <- process(acc, verbose=verbose);
>> >> >> bpn <- BasePositionNormalization(csC, target="zero");
>> >> >> csN <- process(bpn, verbose=verbose);
>> >> >>
>> >> >> then what's the output of:
>> >> >> print(csN);
>> >> >> print(getFullNames(csN));
>> >> >> cdf2 <- getCdf(csN);
>> >> >> stopifnot(identical(getChecksum(cdf2), getChecksum(cdf))); #
>> >> >> IMPORTANT
>> >> >>
>> >> >> If you get an error above, what is the output of print(cdf2) and
>> >> >> getChecksum(cdf2)?  If not, continue:
>> >> >>
>> >> >> plm <- AvgCnPlm(csN, mergeStrands=TRUE, combineAlleles=TRUE);
>> >> >> units <- findUnitsTodo(plm);
>> >> >> str(units);
>> >> >> ##  int [1:934968] 622 623 624 625 626 627 628 629 630 631 ...
>> >> >>
>> >> >> Do you get the same?  If so, continue:
>> >> >>
>> >> >> units2 <- fit(plm, verbose=-10);
>> >> >>
>> >> >> You should get something like the following, except from the
>> >> >> dataset/sample names (if not, let me know):
>> >> >>
>> >> >> ## Fitting model of class AvgCnPlm...
>> >> >> ##  AvgCnPlm:
>> >> >> ##  Data set: GSE13372
>> >> >> ##  Chip type: GenomeWideSNP_6,Full
>> >> >> ##  Input tags: testset,ACC,ra,-XY,BPN,-XY
>> >> >> ##  Output tags: testset,ACC,ra,-XY,BPN,-XY,AVG,A+B
>> >> >> ##  Parameters: {probeModel: chr "pm", shift: num 0, flavor: chr
>> >> >> "median", mergeStrands: logi TRUE, combineAlleles: logi TRUE}
>> >> >> ##  Path:
>> >> >> plmData/GSE13372,testset,ACC,ra,-XY,BPN,-XY,AVG,A+B/GenomeWideSNP_6
>> >> >> ##  RAM: 0.01MB
>> >> >> ##  Identifying non-estimated units...
>> >> >> ##   Identifying non-fitted units in chip-effect file...
>> >> >> ##    Pathname:
>> >> >>
>> >> >>
>> >> >>
>> >> >> plmData/GSE13372,testset,ACC,ra,-XY,BPN,-XY,AVG,A+B/GenomeWideSNP_6/GSM337708,chipEffects.CEL
>> >> >> ##    Found indices cached on file
>> >> >> ##    Reading data for these 1881415 cells...
>> >> >> ##    Reading data for these 1881415 cells...done
>> >> >> ##    Looking for stdvs <= 0 indicating non-estimated units:
>> >> >> ##     int [1:934968] 622 623 624 625 626 627 628 629 630 631 ...
>> >> >> ##   Identifying non-fitted units in chip-effect file...done
>> >> >> ##  Identifying non-estimated units...done
>> >> >> ##  Getting model fit for 934968 units.
>> >> >> ##  Identifying unit types:...
>> >> >> ##   Units:
>> >> >> ##    int [1:934968] 622 623 624 625 626 627 628 629 630 631 ...
>> >> >> ##   Unit types:
>> >> >> ##    atomic [1:934968] 2 2 2 2 2 2 2 2 2 2 ...
>> >> >> ##    - attr(*, "types")= Named int [1:8] 0 1 2 3 4 5 6 7
>> >> >> ##     ..- attr(*, "names")= chr [1:8] "unknown" "expression"
>> >> >> "genotyping" "resequencing" ...
>> >> >> ##   unitTypes
>> >> >> ##        2
>> >> >> ##   934968
>> >> >> ##   Unique unit types:
>> >> >> ##   genotyping
>> >> >> ##            2
>> >> >> ##  Identifying unit types:...done
>> >> >> ##  Setting up parameter sets...
>> >> >> ##   Creating CEL file...
>> >> >> ##    Chip type: GenomeWideSNP_6,Full
>> >> >> ##    Pathname:
>> >> >>
>> >> >>
>> >> >>
>> >> >> plmData/GSE13372,testset,ACC,ra,-XY,BPN,-XY,AVG,A+B/GenomeWideSNP_6/probeAffinities.CEL
>> >> >> ##    Returning already existing file.
>> >> >> ##   Creating CEL file...done
>> >> >> ##             used (Mb) gc trigger  (Mb) max used  (Mb)
>> >> >> ##   Ncells  549305 29.4    1606180  85.8 13015638 695.2
>> >> >> ##   Vcells 7692698 58.7   23004541 175.6 40154683 306.4
>> >> >> ##  Setting up parameter sets...done
>> >> >> ##  Fitting AvgCnPlm for each unit type separately...
>> >> >> ##   Unit types:
>> >> >> ##   genotyping
>> >> >> ##            2
>> >> >> ##   Unit type #1 ('genotyping') of 1...
>> >> >> ##    Unit type: genotyping (code=2)
>> >> >> ##    Number of units of this type: 934968
>> >> >> ##     int [1:934968] 622 623 624 625 626 627 628 629 630 631 ...
>> >> >> ##    Fitting the model by unit dimensions (at least for the large
>> >> >> classes)...
>> >> >> ##     Grouping units into equivalent (unit,group,cell)
>> >> >> dimensions...
>> >> >> [...]
>> >> >>
>> >> >> What is particular interesting are the outputs of your unit types
>> >> >> and
>> >> >> unit indices.
>> >> >>
>> >> >> /Henrik
>> >> >>
>> >> >> On Mon, Nov 26, 2012 at 10:45 AM, Ioana Cutcutache
>> >> >> <ioana.cu...@gmail.com> wrote:
>> >> >> >> library("aroma.affymetrix")
>> >> >> >> log <- verbose <- Arguments$getVerbose(-8, timestamp=TRUE)
>> >> >> >> options(digits=4)
>> >> >> >> cdf <- AffymetrixCdfFile$byChipType("GenomeWideSNP_6",
>> >> >> >> tags="Full")
>> >> >> >> print(cdf)
>> >> >> >> gi <- getGenomeInformation(cdf)
>> >> >> >> print(gi)
>> >> >> >> si <- getSnpInformation(cdf)
>> >> >> >> print(si)
>> >> >> >> acs <- AromaCellSequenceFile$byChipType(getChipType(cdf,
>> >> >> >> fullname=FALSE))
>> >> >> >> print(acs)
>> >> >> >> cdf <- AffymetrixCdfFile$byChipType("GenomeWideSNP_6",
>> >> >> >> tags="Full")
>> >> >> >> csR <- AffymetrixCelSet$byName("test", cdf=cdf)
>> >> >> >> print(csR)
>> >> >> >> acc <- AllelicCrosstalkCalibration(csR, model="CRMAv2")
>> >> >> >> print(acc)
>> >> >> >> csC <- process(acc, verbose=verbose)
>> >> >> >> print(csC)
>> >> >> >> bpn <- BasePositionNormalization(csC, target="zero")
>> >> >> >> print(bpn)
>> >> >> >> csN <- process(bpn, verbose=verbose)
>> >> >> >> print(csN)
>> >> >
>> >> > --
>> >> > When reporting problems on aroma.affymetrix, make sure 1) to run the
>> >> > latest
>> >> > version of the package, 2) to report the output of sessionInfo() and
>> >> > traceback(), and 3) to post a complete code example.
>> >> >
>> >> >
>> >> > You received this message because you are subscribed to the Google
>> >> > Groups
>> >> > "aroma.affymetrix" group with website http://www.aroma-project.org/.
>> >> > To post to this group, send email to aroma-af...@googlegroups.com
>> >> > To unsubscribe and other options, go to
>> >> > http://www.aroma-project.org/forum/
>> >
>> > --
>> > When reporting problems on aroma.affymetrix, make sure 1) to run the
>> > latest
>> > version of the package, 2) to report the output of sessionInfo() and
>> > traceback(), and 3) to post a complete code example.
>> >
>> >
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "aroma.affymetrix" group with website http://www.aroma-project.org/.
>> > To post to this group, send email to aroma-af...@googlegroups.com
>> > To unsubscribe and other options, go to
>> > http://www.aroma-project.org/forum/
>
> --
> When reporting problems on aroma.affymetrix, make sure 1) to run the latest
> version of the package, 2) to report the output of sessionInfo() and
> traceback(), and 3) to post a complete code example.
>
>
> You received this message because you are subscribed to the Google Groups
> "aroma.affymetrix" group with website http://www.aroma-project.org/.
> To post to this group, send email to aroma-affymetrix@googlegroups.com
> To unsubscribe and other options, go to http://www.aroma-project.org/forum/

-- 
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group with website http://www.aroma-project.org/.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe and other options, go to http://www.aroma-project.org/forum/

Re: [aroma.affymetrix] problem using CRMA v2

Reply via email to