Thanks Davis.
The problems are clear now. Just curious about the two formulas
>>> norm_counts.table <- t(t(d$pseudo.alt)*(d$samples$norm.factors))
>>> This is *not* correct. The TMM normalization factors refer to library sizes
>>> only and should not be used to normalize counts in this way.
However, I checked the sum of each library that are VERY close to the
common.lib.size. as
# Zygote1 Zygote2 Octant1 Octant2 Globular1 Globular2 Heart1
Heart2 Torpedo1 Torpedo2 Bent1 Bent2 Mature1 Mature2
# 16538549 16542204 16541874 16540999 16544275 16548353 16536743
16539129 16541883 16546470 16545841 16547124 16544220 16541251
> d$common.lib.size
[1] 16554344.47
While using your formula:
>>> norm_cpm.table <- 1e06*t(t(d$counts) / d$samples$lib.size)
I got colSums (norm_cpm.table) as:
# Zygote1 Zygote2 Octant1 Octant2 Globular1 Globular2 Heart1
Heart2 Torpedo1 Torpedo2 Bent1 Bent2 Mature1 Mature2
# 999024.9 999414.7 999240.4 999097.4 999444.2 999516.7
998947.1 999007.6 999268.1 999407 999502.6 999476 999255.3
999224.1
Or, if I use
>>> cpm <- 1e06*t(t(d$counts) / (d$samples$lib.size*d$samples$norm.factors) )
I got ColSums as following :
Zygote1 Zygote2 Octant1 Octant2 Globular1 Globular2
Heart1 Heart2 Torpedo1 Torpedo2 Bent1 Bent2
Mature1 Mature2
1300191.3 1307096.6 877903.3 727683.3 1142933.3 1113715.6 710072.8
770758.1 679471.8 689203.1 972693.7 1086163.1 1617750.8
1635113.6
The last set is similar to the one that brought my question in my first email.
The point is with the other two. The first formula gave numbers close to
common.lib.size, which made sense to me and that's why I thought it is the
right formula in my correction email, but it is wrong! The second formula gave
similar sum counts for each library, which are quite different to the
common.lib.size, but it is the right one! Maybe this is not important to me
anymore, but it still bugs me as it was not addressed in the paper (Robinson &
Oshlack, 2010) or in the manuals. Do you by chance have any explanation about
this?
Thank you!
Yifang
Yifang Tan
From: [email protected]
Subject: Re: [Bioc-sig-seq] edgeR common library size question.
Date: Fri, 24 Jun 2011 12:21:03 +1000
To: [email protected]
Hi Yifang
I've received a few emails with edgeR questions from you overnight, two of
which look the same, so I'll try to answer all of your questions here and hope
that I haven't missed anything important.
(1)Normalized counts are a useful way visually to compare relative expression
levels across samples for a gene---but they should not (of course) be used for
DE analysis in edgeR directly.
In a later email you proposed this formula for computing normalized
counts:norm_counts.table <- t(t(d$pseudo.alt)*(d$samples$norm.factors))
This is *not* correct. The TMM normalization factors refer to library sizes
only and should not be used to normalize counts in this way. At the moment we
favour looking at counts per million as a normalized measure---it is easy to
understand how they are calculated and the scale is a convenient one for
interpretation. As I wrote in my previous email, these can be computed as
follows:cpm <- 1e06*t( t(d$counts) / d$samples$lib.size )If you wish to include
TMM normalization to adjust the library sizes when getting counts per million
(a good idea), then this will do the job:cpm <- 1e06*t( t(d$counts) /
(d$samples$lib.size*d$samples$norm.factors) )In principle, the pseudocounts are
an appropriate normalized version of the counts for visual comparison, but as
we've seen they become less transparent once normalization factors are used in
the analysis.
The TMM normalization factors change the *effective* library size, they are not
used to normalize the *count data*. That is what is meant by normalization in
this context. We still make a library size adjustment, but we normalize the
library sizes with TMM to deal with compositional bias etc. It really is beyond
the limits of my time and desire to explain here in greater detail the hows and
whys. The TMM paper (Robinson & Oshlack, 2010) has the full explanations, so I
would refer you there.
(2)You are losing your rownames because of the way you are manipulating your
data in R---this is not an edgeR problem. Your code:countsTable <-
read.delim("Deep_seq_data.csv", header=T, stringsAsFactors=T)
# Remove the first column which is FeatureID as AGI number
d<-countsTable[, -1]
conditions<-rep(c("Zygote", "Octant", "Globular", "Heart", "Torpedo", "Bent",
"Mature"), each=2)
#set the row name for database, believed to be mySQL
rownames(d)<-countsTable[,1]
You assign rownames to the object "d" here.
# make a DGE object
d<-DGEList(counts=countsTable[,-1], group=conditions)
but when you define your DGEList, you use countsTable[,-1], which (I can only
assume) has no rownames.
If you simply call the line rownames(d)<-countsTable[,1] *after* you have
defined your DGEList object d, then you will have all of your rownames
throughout the rest of your analysis.
This problem:detags.com.ZO <- rownames(topTags(de.com.ZO, n=20592)$table)
#n=20592 is the total row number in the analysis
#Always got problem with this line even I declare the total number of rows in
full as above.
> d$counts[detags.com.ZO, ]
Error: subscript out of bounds
Should disappear once you have your rownames correctly defined as suggested
above. I believe the problem here is that rownames(d$counts) is NULL, and
rownames(topTags(de.com.ZO, n=Inf)$table) are "tag.1", "tag.2" etc., so they do
not match up. If you want to include all genes in your data set, then the
neatest way is to set n=Inf in your call to topTags as shown here.
Hope all of that gets you running smoothly.
Best wishesDavis
On Jun 24, 2011, at 6:31 AM, yifang tan wrote:Thanks Davis!
It went very well after I check the number with
"colSums(d$pseudo.alt)*d$samples$norm.factors", the pseudocounts Sum of each
library are very similar to each other as expected.
To follow this point I have another question about the meaning of the above
formula. What does it mean in mathematics or/and biology, if any?
If I call the pseudo-counts (from d$pseudo.alt) as the
library-size-adjusted-counts (right ?), can I call the new number
"d$pseudo.alt*d$samples$norm.factors" as normalized-counts? The calculation of
pseudocounts is explained in the User's Guide, but not this one, except one
sentence that might be related in the man page of calcNormFactors(): "For
symmetry, normalization factors are adjusted to multiply to 1". I do not feel
very clear with it.
The reason I want to get this "normalized number" is to compare the relative
expression level of each gene across the conditions. I am somehow confused with
the difference between the library size adjustment and the normalization (This
is mentioned on page 3 of the user's guide), when I was trying to convince my
colleague the pseudo-counts of a gene should be used to compare the relative
expression level, but this formula integrates both library size and the
normalization factor, which seems somehow redundant, am I right?
Another general question is the row names of the data. I found after the
DGEList object was created, the row names are gone. Before the analysis I
assigned the row names to the IDs, but the results are labeled with tag.1,
tag.2, tag.53 etc when the analysis is done:
> head(de.com.ZO$table)
logConc logFC p.value
tag.1 -17.48306247 2.1215798358 2.723289267e-05
tag.2 -17.06332866 0.6656604697 1.705717572e-01
tag.3 -11.99482641 -0.3162543327 4.915230723e-01
tag.4 -15.76854867 -0.5882280375 2.121440211e-01
tag.5 -15.22523652 1.9007994967 7.434382812e-05
How to o keep the rownames always in position? This may be related to R data
manupilation, but with edgeR I felt lost, especially after I filtered out the
extreme low cpm genes as the dataset is different from the original one.
################This is what I used as copied from the guide###############
countsTable <- read.delim("Deep_seq_data.csv", header=T, stringsAsFactors=T)
# Remove the first column which is FeatureID as AGI number
d<-countsTable[, -1]
conditions<-rep(c("Zygote", "Octant", "Globular", "Heart", "Torpedo", "Bent",
"Mature"), each=2)
#set the row name for database, believed to be mySQL
rownames(d)<-countsTable[,1]
# make a DGE object
d<-DGEList(counts=countsTable[,-1], group=conditions)
#check the counts
dim(d)
head(d$counts)
head(d$samples)
# Filter genes with >=1 counts per million, in at leats 2 samples
d<-d[rowSums(1e+06*d$counts/expandAsMatrix(d$samples$lib.size, dim(d))>=1)>=2,]
#get the gene number satisfying the above condition and for display use of
topTag
dim(d)
d
> d
An object of class "DGEList"
$samples
group lib.size norm.factors
Zygote1 Zygote 21012147 1
Zygote2 Zygote 19924212 1
Octant1 Octant 26002900 1
Octant2 Octant 9660245 1
Globular1 Globular 17139388 1
9 more rows ...
$counts
Zygote1 Zygote2 Octant1 Octant2 Globular1 Globular2 Heart1 Heart2 Torpedo1
Torpedo2 Bent1 Bent2 Mature1 Mature2
[1,] 40 42 322 158 616 402 1511 1897 1122
640 1121 351 0 36
[2,] 115 68 270 123 94 37 403 517 83
24 166 66 2 22
[3,] 4228 4340 4794 3678 1527 1146 960 1491 747
465 2831 2138 11880 24451
[4,] 242 441 369 223 72 40 628 374 223
111 1796 426 106 133
[5,] 220 204 1427 699 1233 332 1108 1011 474
284 499 129 96 196
20587 more rows ...
$all.zeros
[1] FALSE FALSE FALSE FALSE FALSE
20587 more elements ...
......
all other steps went well without any problem except the following one
......
detags.com.ZO <- rownames(topTags(de.com.ZO, n=20592)$table) #n=20592 is
the total row number in the analysis
#Always got problem with this line even I declare the total number of rows in
full as above.
> d$counts[detags.com.ZO, ]
Error: subscript out of bounds
sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8
LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_CA.UTF-8 LC_PAPER=en_CA.UTF-8
LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ALL_1.4.7 Biobase_2.12.1 limma_3.8.2 edgeR_2.2.5
loaded via a namespace (and not attached):
[1] tools_2.13.0
##############################################################
My questions are long, but I feel after these two questions get clear, I am
ready with edgeR.
Thank you!
Yifang
Yifang Tan
Subject: Re: [Bioc-sig-seq] edgeR common library size question.
From: [email protected]
Date: Wed, 22 Jun 2011 12:28:56 +1000
CC: [email protected]
To: [email protected]
Hi Yifang
I have discussed this issue with Gordon and Mark R---briefly, this is not
something to worry about. We do not expect the library sizes of the
pseudocounts necessarily to be approximately equal when TMM normalization (or
another type of normalization factor) is applied. That comment in the User's
Guide dates back to the pre-TMM normalization days and is no longer relevant. I
will update the User's Guide for the next release to remove that confusing
statement. Sorry for the angst that note in the Guide has caused.
In answer to your specific questions:1) Best way to check that the
normalization has performed well is to look at MA-plots (plotSmear in edgeR).
2) It does not matter that the normalized library size of the conditions
differs from the common.lib.size.
3) Yes, the results are still meaningful even if the library sizes of the
pseudocounts are different.
4) I suspect that the differences in library sizes of the pseudocounts are
caused by the same thing that causes differences in normalization factors: sets
of genes that are highly expressed in a given sample but not in others. For
some reassurance, you could look
at:colSums(d$pseudo.alt)*d$samples$norm.factorsThese values should be quite
similar (do not worry too much if they differ), which shows the effect of the
normalization factors on the effective library size.
5) We routinely remove genes with very low expression levels in our analysis
and I would recommend it generally. In your experiment you have a minimum group
size of two, so I would recommend keeping genes that are expressed at a level
of 0.5 counts per million or greater in at least two samples. This is almost
exactly what you did in your earlier email. Removing genes with <=40 counts in
>= 6 samples is probably overdoing it.
Example code (in the next release there will be a convenience function to
calculate counts per million easily from a DGEList object):cpm <- 1e06*t(
t(d$counts) / d$samples$lib.size )d <- d[ rowSums( cpm > 0.5 ) >= 2, ]
Mark provided me with a really nice reproducible example (thanks Mark!) that
illustrates how and why TMM normalization works and also why we do not expect
the library sizes of the pseudocounts to be equal after normalization factors
are applied. Try running the code below for yourself:
========================library(edgeR)
f <- url("http://bioinf.wehi.edu.au/folders/tmm_rnaseq/LK_data.RData")
load(f)
close(f)
D <- as.matrix(MA.subsetA$M)
g <- as.character(MA.subsetA$genes$EnsemblGeneID)
source("http://bioinf.wehi.edu.au/folders/tmm_rnaseq/functions.R")
libSizes <- colSums(D)
foldDiff <- 3
pUp <- .8
pDifferential=0.1
xx <- generateDataset(commonTags=2e4, uniqueTags=c(3000,100),
foldDifference=foldDiff,
pUp=pUp, pDifferential=pDifferential, empiricalDist=D[,1],
libLimits=rep(1e6,2))
d <- DGEList(counts=xx$DATA, group=c(1,2))
d1 <- calcNormFactors(d)
par(mfrow=c(1,2))
plotSmear(d, ylim=c(-5,5))
plotSmear(d1, ylim=c(-5,5))
d <- estimateCommonDisp(d)
colSums(d$pseudo.alt)
d1 <- estimateCommonDisp(d1)
colSums(d1$pseudo.alt)==========================
Hope that clarifies a few things for you.
Best wishesDavis
P.S. In general it is good form not to re-post the same question again on the
list. We receive and read questions on the BioC mailing lists, but sometimes it
can be a few days before we can make time to respond.
On Jun 22, 2011, at 7:50 AM, yifang tan wrote:
Hi Davis/Gordon:
I posted my question here again hope you can see it.
When I tried edgeR and met a problem with the number of pseudocounts
for each library after normalization, which should come to close numbers. This
have been addressed in edgeR
several times that "the total counts in each libray of the pseudocounts
agrees well with the common library size" (page 27 & 44 of the
user's guide), but my result are quite different between treatments although
for the replicates within treatment the pseudocounts are very similar. I
can't get to the common.lib.size for each treatment after I tried several
methods (TMM, RLE and quantile).
1) Did I miss anything during my run with edgeR? How can I assure the
normalization went well?
2) Does the normalized library size of the conditions matter or NOT, if they
are different from the common.lib.size?
3) Is the result still meaningful even the library sizes of pseudocounts are
different?
4) What could probably be the reason(s) to cause the library sizes of
pseudocounts so different?
5) Should I remove the smaller number reads as some other people do?
After I removed the smaller numbers of counts (<=40 in >=6 out of
14 samples), the normalized library sizes become very similar.
I can feel my lack of mathematics for the packages. I attach part of my code
here.
---------------------------------------------------------------------
d$samples$lib.size
#"Zygote1", 21012147
"Zygote2", 19924212
"Octant1", 9660245
"Octant2", 26002900
"Globular1",17139388
"Globular2", 7649319
"Heart1", 16430105
"Heart2", 20101956
"Torpedo1", 12920266
"Torpedo2", 6306742
"Bent1", 44241095
"Bent2", 20094409
"Mature1", 15166090
"Mature2", 23203758
d$common.lib.size
[1] 16554344.47
colSums(d$pseudo.alt)
# Zygote1 21523774.62
Zygote2 21638415.63
Octant1 14533481.82
Octant2 12046955.46
Globular1 18920316.62
Globular2 18439528.30
Heart1 11754608.30
Heart2 12759230.11
Torpedo1 11248245.52
Torpedo2 11410667.92
Bent1 16101723.65
Bent2 17980670.24
Mature1 26785396.02
Mature2 27067289.80
#
sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8
LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_CA.UTF-8 LC_PAPER=en_CA.UTF-8
LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ALL_1.4.7 Biobase_2.12.1 limma_3.8.2 edgeR_2.2.5
loaded via a namespace (and not attached):
[1] tools_2.13.0
---------------------------------------------------------------------
[[elided Hotmail spam]]
Yifang
Yifang Tan
[[alternative HTML version deleted]]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
------------------------------------------------------------------------
Davis J McCarthy
Research Technician
Bioinformatics Division
Walter and Eliza Hall Institute of Medical Research
1G Royal Parade, Parkville, Vic 3052, Australia
[email protected]
http://www.wehi.edu.au
______________________________________________________________________
The information in this email is confidential and intended solely for the
addressee.
You must not disclose, forward, print or use it without the permission of the
sender.
______________________________________________________________________
------------------------------------------------------------------------
Davis J McCarthy
Research Technician
Bioinformatics Division
Walter and Eliza Hall Institute of Medical Research
1G Royal Parade, Parkville, Vic 3052, Australia
[email protected]
http://www.wehi.edu.au
______________________________________________________________________
The information in this email is confidential and intended solely for the
addressee.
You must not disclose, forward, print or use it without the permission of the
sender.
______________________________________________________________________
[[alternative HTML version deleted]]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing