Hi Yifang

I have discussed this issue with Gordon and Mark R---briefly, this is not 
something to worry about. We do not expect the library sizes of the 
pseudocounts necessarily to be approximately equal when TMM normalization (or 
another type of normalization factor) is applied. That comment in the User's 
Guide dates back to the pre-TMM normalization days and is no longer relevant. I 
will update the User's Guide for the next release to remove that confusing 
statement. Sorry for the angst that note in the Guide has caused.

In answer to your specific questions:
1) Best way to check that the normalization has performed well is to look at 
MA-plots (plotSmear in edgeR). 

2) It does not matter that the normalized library size of the conditions 
differs from the common.lib.size. 

3) Yes, the results are still meaningful even if the library sizes of the 
pseudocounts are different.

4) I suspect that the differences in library sizes of the pseudocounts are 
caused by the same thing that causes differences in normalization factors: sets 
of genes that are highly expressed in a given sample but not in others. For 
some reassurance, you could look at:
colSums(d$pseudo.alt)*d$samples$norm.factors
These values should be quite similar (do not worry too much if they differ), 
which shows the effect of the normalization factors on the effective library 
size. 

5) We routinely remove genes with very low expression levels in our analysis 
and I would recommend it generally. In your experiment you have a minimum group 
size of two, so I would recommend keeping genes that are expressed at a level 
of 0.5 counts per million or greater in at least two samples. This is almost 
exactly what you did in your earlier email. Removing genes with <=40 counts in 
>= 6 samples is probably overdoing it.

Example code (in the next release there will be a convenience function to 
calculate counts per million easily from a DGEList object):
cpm <- 1e06*t( t(d$counts) / d$samples$lib.size )
d <- d[ rowSums( cpm > 0.5 ) >= 2, ]

Mark provided me with a really nice reproducible example (thanks Mark!) that 
illustrates how and why TMM normalization works and also why we do not expect 
the library sizes of the pseudocounts to be equal after normalization factors 
are applied. Try running the code below for yourself:

========================
library(edgeR)

f <- url("http://bioinf.wehi.edu.au/folders/tmm_rnaseq/LK_data.RData";)
load(f)
close(f)

D <- as.matrix(MA.subsetA$M)
g <- as.character(MA.subsetA$genes$EnsemblGeneID)

source("http://bioinf.wehi.edu.au/folders/tmm_rnaseq/functions.R";)

libSizes <- colSums(D)
foldDiff <- 3
pUp <- .8
pDifferential=0.1

xx <- generateDataset(commonTags=2e4, uniqueTags=c(3000,100), 
foldDifference=foldDiff, 
                     pUp=pUp, pDifferential=pDifferential, empiricalDist=D[,1], 
libLimits=rep(1e6,2))

d <- DGEList(counts=xx$DATA, group=c(1,2))
d1 <- calcNormFactors(d)

par(mfrow=c(1,2))
plotSmear(d, ylim=c(-5,5))
plotSmear(d1, ylim=c(-5,5))

d <- estimateCommonDisp(d)
colSums(d$pseudo.alt)
d1 <- estimateCommonDisp(d1)
colSums(d1$pseudo.alt)
==========================

Hope that clarifies a few things for you.

Best wishes
Davis

P.S. In general it is good form not to re-post the same question again on the 
list. We receive and read questions on the BioC mailing lists, but sometimes it 
can be a few days before we can make time to respond.



On Jun 22, 2011, at 7:50 AM, yifang tan wrote:

> 
> Hi Davis/Gordon:
> I posted my question here again hope you can see it. 
> When I tried edgeR and met a problem with the number of pseudocounts 
> for each library after normalization, which should come to close numbers. 
> This have been addressed in edgeR 
> several times that "the total counts in each libray of the pseudocounts 
> agrees well with the common library size" (page 27 & 44 of the 
> user's guide), but my result are quite different between treatments although 
> for the replicates within treatment the  pseudocounts are very similar.  I 
> can't get to the common.lib.size for each treatment after I tried several 
> methods (TMM, RLE and quantile).
> 1) Did I miss anything during my run with edgeR? How can I assure the 
> normalization went well?
> 
> 2) Does the normalized library size of the conditions matter or NOT, if they 
> are different from the common.lib.size?
> 
> 
> 3) Is the result still meaningful even the library sizes of  pseudocounts are 
> different? 
> 
> 
> 4) What could probably be the reason(s) to cause the library sizes of  
> pseudocounts so different?
> 
> 
> 5) Should I remove the smaller number reads as some other people do? 
> After I removed the smaller numbers of counts (<=40 in >=6 out of 
> 14 samples), the normalized library sizes become very similar.
> 
> 
> I can feel my lack of mathematics for the packages. I attach part of my code 
> here. 
> 
> ---------------------------------------------------------------------
> d$samples$lib.size
> #"Zygote1",  21012147
> "Zygote2",  19924212 
> "Octant1",   9660245    
> "Octant2",  26002900 
> "Globular1",17139388      
> "Globular2", 7649319   
> "Heart1",   16430105  
> "Heart2",   20101956   
> "Torpedo1", 12920266  
> "Torpedo2",  6306742 
> "Bent1",    44241095 
> "Bent2",    20094409 
> "Mature1",  15166090 
> "Mature2",  23203758
> 
> d$common.lib.size                   
> [1] 16554344.47
> 
> colSums(d$pseudo.alt)
> # Zygote1  21523774.62
> Zygote2    21638415.63
> Octant1    14533481.82
> Octant2    12046955.46
> Globular1  18920316.62
> Globular2  18439528.30
> Heart1    11754608.30
> Heart2    12759230.11
> Torpedo1  11248245.52 
> Torpedo2  11410667.92 
> Bent1     16101723.65
> Bent2     17980670.24 
> Mature1   26785396.02 
> Mature2   27067289.80 
> #   
> 
>> sessionInfo()
> R version 2.13.0 (2011-04-13)
> Platform: x86_64-pc-linux-gnu (64-bit)
> locale:
> [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8 
>        LC_COLLATE=en_CA.UTF-8    
> [5] LC_MONETARY=C              LC_MESSAGES=en_CA.UTF-8    
> LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
> [9] LC_ADDRESS=C               LC_TELEPHONE=C             
> LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> other attached packages:
> [1] ALL_1.4.7      Biobase_2.12.1 limma_3.8.2    edgeR_2.2.5   
> loaded via a namespace (and not attached):
> [1] tools_2.13.0
> ---------------------------------------------------------------------
> [[elided Hotmail spam]]
> 
> 
> Yifang
>               
> 
> Yifang Tan
>                                         
>       [[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> [email protected]
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

------------------------------------------------------------------------
Davis J McCarthy
Research Technician
Bioinformatics Division
Walter and Eliza Hall Institute of Medical Research
1G Royal Parade, Parkville, Vic 3052, Australia
[email protected]
http://www.wehi.edu.au




______________________________________________________________________
The information in this email is confidential and intend...{{dropped:8}}

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to