Re: [Bioc-sig-seq] edgeR common library size question.

Davis McCarthy Mon, 27 Jun 2011 00:37:03 -0700

Yifang

It really is not important, and there is little to be gained by looking at 
these values. Once normalization has been applied (or when it is necessary), 
you cannot draw reliable conclusions from looking at the raw counts or the 
pseudocounts, or the library sizes.


We use the normalization factors to adjust the library sizes so that we can 
compare the expression level of a gene across different samples, the 
composition of which may be quite different. When we allow different 
compositions of different samples (e.g. sets of genes that are very highly 
expressed in one sample but not in others) then there is absolutely no 
requirement or expectation for the new library sizes to be approximately equal 
across all samples. The DE testing is still valid---in fact even more valid 
than if we were not to use the normalization factors to adjust the effective 
library size.  

In the context of the mailing list I'm not sure that I can offer any more 
explanation. I could perhaps provide a little example to demonstrate further 
(although I had hoped that by working through Mark R's example that I posted 
previously you would get a feel for how/why TMM normalization works and is 
useful). However, I don't really want to distract you with more stuff that is 
all only tangential to your real question of interest, which I'm sure is 
finding out which genes in your dataset show significant evidence of 
differential expression. The library sizes adjusted or not are not important 
for this task.

Cheers
Davis




On Jun 25, 2011, at 8:13 AM, yifang tan wrote:

> Thanks Davis. 
> The problems are clear now. Just curious about the two formulas
> 
> >>> norm_counts.table <- t(t(d$pseudo.alt)*(d$samples$norm.factors))
> >>> This is *not* correct. The TMM normalization factors refer to library 
> >>> sizes only and should not be used to normalize counts in this way.
> 
> However, I checked the sum of each library that are VERY close to the 
> common.lib.size. as 
> #  Zygote1   Zygote2   Octant1   Octant2 Globular1 Globular2    Heart1    
> Heart2  Torpedo1  Torpedo2     Bent1     Bent2   Mature1   Mature2 
> # 16538549  16542204  16541874  16540999  16544275  16548353  16536743  
> 16539129  16541883  16546470  16545841  16547124 16544220  16541251 
> 
> > d$common.lib.size
> [1] 16554344.47
> 
> While using your formula: 
> >>> norm_cpm.table <- 1e06*t(t(d$counts) / d$samples$lib.size)
> I got colSums (norm_cpm.table) as:
> #  Zygote1   Zygote2   Octant1   Octant2 Globular1 Globular2    Heart1    
> Heart2  Torpedo1  Torpedo2     Bent1     Bent2     Mature1   Mature2 
> # 999024.9    999414.7    999240.4    999097.4    999444.2    999516.7    
> 998947.1    999007.6    999268.1    999407    999502.6    999476    999255.3  
>   999224.1
> 
> Or, if I use 
> >>> cpm <- 1e06*t(t(d$counts) / (d$samples$lib.size*d$samples$norm.factors) )
> I got ColSums as following :
>      Zygote1      Zygote2      Octant1      Octant2    Globular1    Globular2 
>       Heart1       Heart2     Torpedo1     Torpedo2        Bent1        Bent2 
>      Mature1      Mature2
> 1300191.3  1307096.6   877903.3   727683.3   1142933.3   1113715.6    
> 710072.8    770758.1   679471.8    689203.1    972693.7   1086163.1    
> 1617750.8   1635113.6 
> 
> The last set is similar to the one that brought my question in my first 
> email. The point is with the other two. The first formula gave numbers close 
> to common.lib.size, which made sense to me and that's why I thought it is the 
> right formula in my correction email,  but it is wrong! The second formula 
> gave similar sum counts for each library, which are quite different to the 
> common.lib.size, but it is the right one! Maybe this is not important to me 
> anymore, but it still bugs me as it was not addressed in the paper (Robinson 
> & Oshlack, 2010) or in the manuals. Do you by chance have any explanation 
> about this? 
> Thank you!
> 
> Yifang
> 
> 
> Yifang Tan
> 
> 
> 
> From: [email protected]
> Subject: Re: [Bioc-sig-seq] edgeR common library size question.
> Date: Fri, 24 Jun 2011 12:21:03 +1000
> To: [email protected]
> 
> Hi Yifang
> 
> I've received a few emails with edgeR questions from you overnight, two of 
> which look the same, so I'll try to answer all of your questions here and 
> hope that I haven't missed anything important.
> 
> (1)
> Normalized counts are a useful way visually to compare relative expression 
> levels across samples for a gene---but they should not (of course) be used 
> for DE analysis in edgeR directly.
> 
> In a later email you proposed this formula for computing normalized counts:
> norm_counts.table <- t(t(d$pseudo.alt)*(d$samples$norm.factors))
> 
> This is *not* correct. The TMM normalization factors refer to library sizes 
> only and should not be used to normalize counts in this way. At the moment we 
> favour looking at counts per million as a normalized measure---it is easy to 
> understand how they are calculated and the scale is a convenient one for 
> interpretation. As I wrote in my previous email, these can be computed as 
> follows:
> cpm <- 1e06*t( t(d$counts) / d$samples$lib.size )
> If you wish to include TMM normalization to adjust the library sizes when 
> getting counts per million (a good idea), then this will do the job:
> cpm <- 1e06*t( t(d$counts) / (d$samples$lib.size*d$samples$norm.factors) )
> In principle, the pseudocounts are an appropriate normalized version of the 
> counts for visual comparison, but as we've seen they become less transparent 
> once normalization factors are used in the analysis.
> 
> The TMM normalization factors change the *effective* library size, they are 
> not used to normalize the *count data*. That is what is meant by 
> normalization in this context. We still make a library size adjustment, but 
> we normalize the library sizes with TMM to deal with compositional bias etc. 
> It really is beyond the limits of my time and desire to explain here in 
> greater detail the hows and whys. The TMM paper (Robinson & Oshlack, 2010) 
> has the full explanations, so I would refer you there.
> 
> (2)
> You are losing your rownames because of the way you are manipulating your 
> data in R---this is not an edgeR problem. Your code:
> countsTable <- read.delim("Deep_seq_data.csv", header=T, stringsAsFactors=T)
> 
> # Remove the first column which is FeatureID as AGI number
> d<-countsTable[, -1]
> 
> conditions<-rep(c("Zygote", "Octant", "Globular", "Heart", "Torpedo", "Bent", 
> "Mature"), each=2)
> 
> #set the row name for database, believed to be mySQL
> rownames(d)<-countsTable[,1]
> 
> You assign rownames to the object "d" here.
> 
> # make a DGE object
> d<-DGEList(counts=countsTable[,-1], group=conditions)
> 
> but when you define your DGEList, you use countsTable[,-1], which (I can only 
> assume) has no rownames.
> 
> If you simply call the line rownames(d)<-countsTable[,1] *after* you have 
> defined your DGEList object d, then you will have all of your rownames 
> throughout the rest of your analysis.
> 
> This problem:
> detags.com.ZO <- rownames(topTags(de.com.ZO, n=20592)$table)     #n=20592 is 
> the total row number in the analysis
> 
> #Always got  problem with this line even I declare the total number of rows 
> in full as above.
> > d$counts[detags.com.ZO, ] 
> Error: subscript out of bounds
> 
> Should disappear once you have your rownames correctly defined as suggested 
> above. I believe the problem here is that rownames(d$counts) is NULL, and 
> rownames(topTags(de.com.ZO, n=Inf)$table) are "tag.1", "tag.2" etc., so they 
> do not match up. If you want to include all genes in your data set, then the 
> neatest way is to set n=Inf in your call to topTags as shown here.
> 
> Hope all of that gets you running smoothly.
> 
> Best wishes
> Davis
> 
> 
> 
> On Jun 24, 2011, at 6:31 AM, yifang tan wrote:
> 
> Thanks Davis!
> 
> It went very well after I check the number with 
> "colSums(d$pseudo.alt)*d$samples$norm.factors", the pseudocounts Sum of each 
> library are very similar to each other as expected. 
> To follow this point I have another question about the meaning of the above 
> formula. What does it mean in mathematics or/and biology, if any? 
> If I call the pseudo-counts (from d$pseudo.alt) as the 
> library-size-adjusted-counts (right ?), can I call the new number 
> "d$pseudo.alt*d$samples$norm.factors" as normalized-counts? The calculation 
> of pseudocounts is explained in the User's Guide, but not this one, except 
> one sentence that might be related in the man page of calcNormFactors(): "For 
> symmetry, normalization factors are adjusted to multiply to 1".  I do not 
> feel very clear with it.
> 
> The reason I want to get this "normalized number" is to compare the relative 
> expression level of each gene across the conditions. I am somehow confused 
> with the difference between the library size adjustment and the normalization 
> (This is mentioned on page 3 of the user's guide), when I was trying to 
> convince my colleague the pseudo-counts of a gene should be used to compare 
> the relative expression level, but this formula  integrates both library size 
> and the normalization factor, which seems somehow redundant, am I right?
> 
> Another general question is the row names of the data. I found after the 
> DGEList object was created, the row names are gone. Before the analysis I 
> assigned the row names to the IDs, but the results are labeled with tag.1, 
> tag.2, tag.53 etc when the analysis is done:
> > head(de.com.ZO$table)
>            logConc         logFC         p.value
> tag.1 -17.48306247  2.1215798358 2.723289267e-05
> tag.2 -17.06332866  0.6656604697 1.705717572e-01
> tag.3 -11.99482641 -0.3162543327 4.915230723e-01
> tag.4 -15.76854867 -0.5882280375 2.121440211e-01
> tag.5 -15.22523652  1.9007994967 7.434382812e-05
> 
> How to o keep the rownames always in position?  This may be related to R data 
> manupilation, but with edgeR I felt lost, especially after I filtered out the 
> extreme low cpm genes as the dataset is different from the original one. 
> 
> ################This is what I used as copied from the guide###############
> countsTable <- read.delim("Deep_seq_data.csv", header=T, stringsAsFactors=T)
> 
> # Remove the first column which is FeatureID as AGI number
> d<-countsTable[, -1]
> 
> conditions<-rep(c("Zygote", "Octant", "Globular", "Heart", "Torpedo", "Bent", 
> "Mature"), each=2)
> 
> #set the row name for database, believed to be mySQL
> rownames(d)<-countsTable[,1]
> 
> # make a DGE object
> d<-DGEList(counts=countsTable[,-1], group=conditions)
> 
> #check the counts
> dim(d)
> head(d$counts) 
> head(d$samples)
> 
> # Filter genes with >=1 counts per million, in at leats 2 samples
> d<-d[rowSums(1e+06*d$counts/expandAsMatrix(d$samples$lib.size, 
> dim(d))>=1)>=2,]
> 
> #get the gene number satisfying the above condition and for display use of 
> topTag
> dim(d)            
> d
> > d
> An object of class "DGEList"
> $samples
>              group lib.size norm.factors
> Zygote1     Zygote 21012147            1
> Zygote2     Zygote 19924212            1
> Octant1     Octant 26002900            1
> Octant2     Octant  9660245            1
> Globular1 Globular 17139388            1
> 9 more rows ...
> 
> $counts
>      Zygote1 Zygote2 Octant1 Octant2 Globular1 Globular2 Heart1 Heart2 
> Torpedo1 Torpedo2 Bent1 Bent2 Mature1 Mature2
> [1,]      40      42     322     158       616       402   1511   1897     
> 1122      640  1121   351       0      36
> [2,]     115      68     270     123        94        37    403    517       
> 83       24   166    66       2      22
> [3,]    4228    4340    4794    3678      1527      1146    960   1491      
> 747      465  2831  2138   11880   24451
> [4,]     242     441     369     223        72        40    628    374      
> 223      111  1796   426     106     133
> [5,]     220     204    1427     699      1233       332   1108   1011      
> 474      284   499   129      96     196
> 20587 more rows ...
> 
> $all.zeros
> [1] FALSE FALSE FALSE FALSE FALSE
> 20587 more elements ...
> ......
> all other steps went well without any problem except the following one
> ......
> detags.com.ZO <- rownames(topTags(de.com.ZO, n=20592)$table)     #n=20592 is 
> the total row number in the analysis
> 
> #Always got  problem with this line even I declare the total number of rows 
> in full as above.
> > d$counts[detags.com.ZO, ] 
> Error: subscript out of bounds
> 
> sessionInfo()
> R version 2.13.0 (2011-04-13)
> Platform: x86_64-pc-linux-gnu (64-bit)
> locale:
> [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8 
>        LC_COLLATE=en_CA.UTF-8    
> [5] LC_MONETARY=C              LC_MESSAGES=en_CA.UTF-8    
> LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
> [9] LC_ADDRESS=C               LC_TELEPHONE=C             
> LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> other attached packages:
> [1] ALL_1.4.7      Biobase_2.12.1 limma_3.8.2    edgeR_2.2.5   
> loaded via a namespace (and not attached):
> [1] tools_2.13.0
> 
> ##############################################################
> 
> My questions are long, but I feel after these two questions get clear, I am 
> ready with edgeR.
> 
> Thank you!
> 
> Yifang
> 
> Yifang Tan
> 
> 
> 
> Subject: Re: [Bioc-sig-seq] edgeR common library size question.
> From: [email protected]
> Date: Wed, 22 Jun 2011 12:28:56 +1000
> CC: [email protected]
> To: [email protected]
> 
> Hi Yifang
> 
> I have discussed this issue with Gordon and Mark R---briefly, this is not 
> something to worry about. We do not expect the library sizes of the 
> pseudocounts necessarily to be approximately equal when TMM normalization (or 
> another type of normalization factor) is applied. That comment in the User's 
> Guide dates back to the pre-TMM normalization days and is no longer relevant. 
> I will update the User's Guide for the next release to remove that confusing 
> statement. Sorry for the angst that note in the Guide has caused.
> 
> In answer to your specific questions:
> 1) Best way to check that the normalization has performed well is to look at 
> MA-plots (plotSmear in edgeR). 
> 
> 2) It does not matter that the normalized library size of the conditions 
> differs from the common.lib.size. 
> 
> 3) Yes, the results are still meaningful even if the library sizes of the 
> pseudocounts are different.
> 
> 4) I suspect that the differences in library sizes of the pseudocounts are 
> caused by the same thing that causes differences in normalization factors: 
> sets of genes that are highly expressed in a given sample but not in others. 
> For some reassurance, you could look at:
> colSums(d$pseudo.alt)*d$samples$norm.factors
> These values should be quite similar (do not worry too much if they differ), 
> which shows the effect of the normalization factors on the effective library 
> size. 
> 
> 5) We routinely remove genes with very low expression levels in our analysis 
> and I would recommend it generally. In your experiment you have a minimum 
> group size of two, so I would recommend keeping genes that are expressed at a 
> level of 0.5 counts per million or greater in at least two samples. This is 
> almost exactly what you did in your earlier email. Removing genes with <=40 
> counts in >= 6 samples is probably overdoing it.
> 
> Example code (in the next release there will be a convenience function to 
> calculate counts per million easily from a DGEList object):
> cpm <- 1e06*t( t(d$counts) / d$samples$lib.size )
> d <- d[ rowSums( cpm > 0.5 ) >= 2, ]
> 
> Mark provided me with a really nice reproducible example (thanks Mark!) that 
> illustrates how and why TMM normalization works and also why we do not expect 
> the library sizes of the pseudocounts to be equal after normalization factors 
> are applied. Try running the code below for yourself:
> 
> ========================
> library(edgeR)
> 
> f <- url("http://bioinf.wehi.edu.au/folders/tmm_rnaseq/LK_data.RData";)
> load(f)
> close(f)
> 
> D <- as.matrix(MA.subsetA$M)
> g <- as.character(MA.subsetA$genes$EnsemblGeneID)
> 
> source("http://bioinf.wehi.edu.au/folders/tmm_rnaseq/functions.R";)
> 
> libSizes <- colSums(D)
> foldDiff <- 3
> pUp <- .8
> pDifferential=0.1
> 
> xx <- generateDataset(commonTags=2e4, uniqueTags=c(3000,100), 
> foldDifference=foldDiff, 
>                      pUp=pUp, pDifferential=pDifferential, 
> empiricalDist=D[,1], libLimits=rep(1e6,2))
> 
> d <- DGEList(counts=xx$DATA, group=c(1,2))
> d1 <- calcNormFactors(d)
> 
> par(mfrow=c(1,2))
> plotSmear(d, ylim=c(-5,5))
> plotSmear(d1, ylim=c(-5,5))
> 
> d <- estimateCommonDisp(d)
> colSums(d$pseudo.alt)
> d1 <- estimateCommonDisp(d1)
> colSums(d1$pseudo.alt)
> ==========================
> 
> Hope that clarifies a few things for you.
> 
> Best wishes
> Davis
> 
> P.S. In general it is good form not to re-post the same question again on the 
> list. We receive and read questions on the BioC mailing lists, but sometimes 
> it can be a few days before we can make time to respond.
> 
> 
> 
> On Jun 22, 2011, at 7:50 AM, yifang tan wrote:
> 
> 
> Hi Davis/Gordon:
> I posted my question here again hope you can see it. 
> When I tried edgeR and met a problem with the number of pseudocounts 
> for each library after normalization, which should come to close numbers. 
> This have been addressed in edgeR 
> several times that "the total counts in each libray of the pseudocounts 
> agrees well with the common library size" (page 27 & 44 of the 
> user's guide), but my result are quite different between treatments although 
> for the replicates within treatment the  pseudocounts are very similar.  I 
> can't get to the common.lib.size for each treatment after I tried several 
> methods (TMM, RLE and quantile).
> 1) Did I miss anything during my run with edgeR? How can I assure the 
> normalization went well?
> 
> 2) Does the normalized library size of the conditions matter or NOT, if they 
> are different from the common.lib.size?
> 
> 
> 3) Is the result still meaningful even the library sizes of  pseudocounts are 
> different? 
> 
> 
> 4) What could probably be the reason(s) to cause the library sizes of  
> pseudocounts so different?
> 
> 
> 5) Should I remove the smaller number reads as some other people do? 
> After I removed the smaller numbers of counts (<=40 in >=6 out of 
> 14 samples), the normalized library sizes become very similar.
> 
> 
> I can feel my lack of mathematics for the packages. I attach part of my code 
> here. 
> 
> ---------------------------------------------------------------------
> d$samples$lib.size
> #"Zygote1",  21012147
> "Zygote2",  19924212 
> "Octant1",   9660245    
> "Octant2",  26002900 
> "Globular1",17139388      
> "Globular2", 7649319   
> "Heart1",   16430105  
> "Heart2",   20101956   
> "Torpedo1", 12920266  
> "Torpedo2",  6306742 
> "Bent1",    44241095 
> "Bent2",    20094409 
> "Mature1",  15166090 
> "Mature2",  23203758
> 
> d$common.lib.size                   
> [1] 16554344.47
> 
> colSums(d$pseudo.alt)
> # Zygote1  21523774.62
> Zygote2    21638415.63
> Octant1    14533481.82
> Octant2    12046955.46
> Globular1  18920316.62
> Globular2  18439528.30
> Heart1    11754608.30
> Heart2    12759230.11
> Torpedo1  11248245.52 
> Torpedo2  11410667.92 
> Bent1     16101723.65
> Bent2     17980670.24 
> Mature1   26785396.02 
> Mature2   27067289.80 
> #   
> 
> sessionInfo()
> R version 2.13.0 (2011-04-13)
> Platform: x86_64-pc-linux-gnu (64-bit)
> locale:
> [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8 
>        LC_COLLATE=en_CA.UTF-8    
> [5] LC_MONETARY=C              LC_MESSAGES=en_CA.UTF-8    
> LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
> [9] LC_ADDRESS=C               LC_TELEPHONE=C             
> LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> other attached packages:
> [1] ALL_1.4.7      Biobase_2.12.1 limma_3.8.2    edgeR_2.2.5   
> loaded via a namespace (and not attached):
> [1] tools_2.13.0
> ---------------------------------------------------------------------
> [[elided Hotmail spam]]
> 
> 
> Yifang
>               
> 
> Yifang Tan
>                                         
>       [[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> [email protected]
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
> ------------------------------------------------------------------------
> Davis J McCarthy
> Research Technician
> Bioinformatics Division
> Walter and Eliza Hall Institute of Medical Research
> 1G Royal Parade, Parkville, Vic 3052, Australia
> [email protected]
> http://www.wehi.edu.au
> 
> 
> 
> 
> ______________________________________________________________________
> The information in this email is confidential and intended solely for the 
> addressee.
> You must not disclose, forward, print or use it without the permission of the 
> sender.
> ______________________________________________________________________
> 
> ------------------------------------------------------------------------
> Davis J McCarthy
> Research Technician
> Bioinformatics Division
> Walter and Eliza Hall Institute of Medical Research
> 1G Royal Parade, Parkville, Vic 3052, Australia
> [email protected]
> http://www.wehi.edu.au
> 
> 
> 
> 
> ______________________________________________________________________
> The information in this email is confidential and intended solely for the 
> addressee.
> You must not disclose, forward, print or use it without the permission of the 
> sender.
> ______________________________________________________________________

------------------------------------------------------------------------
Davis J McCarthy
Research Technician
Bioinformatics Division
Walter and Eliza Hall Institute of Medical Research
1G Royal Parade, Parkville, Vic 3052, Australia
[email protected]
http://www.wehi.edu.au




______________________________________________________________________
The information in this email is confidential and intended solely for the 
addressee.
You must not disclose, forward, print or use it without the permission of the 
sender.
______________________________________________________________________
        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] edgeR common library size question.

Reply via email to