On Mon, May 4, 2009 at 5:25 AM, Dan Bolser <[email protected]> wrote:
> 2009/5/4 Michael Lawrence <[email protected]>: > > > > > > On Mon, May 4, 2009 at 2:19 AM, Dan Bolser <[email protected]> wrote: > >> > >> 2009/4/23 Martin Morgan <[email protected]>: > >> > Dan Bolser <[email protected]> writes: > >> > > >> >> Sorry for the noob question, but is there a set of standard quality > >> >> checks in R that I can run over some 454 data? I have the fasta and > >> >> the fasta format quality files as well as an sff. I scanned the > manual > >> >> for the ShortReads package, but it seems focused on Illumina, I > >> >> couldn't pick out the general bits from the specifics. > >> > > >> > Hi Dan -- > >> > > >> > You might be on somewhat uncharted territory here; most of our > >> > experience (though we have some 454 data now) is with Solexa. > >> > > >> > I don't think the standard QA pipeline, along the lines of > >> > report(qa(<...>)), will work at the moment, but I'll try to add that > >> > today. > >> > > >> > You should be able to read the fasta and quality scores with > >> > read454(). This returns a 'ShortReadQ' object, srq, that bundles the > >> > reads, their quality scores, and their ids. > >> > >> Hi Martin, thanks for the info on this. I'm have only just got round > >> to looking at this but I'm a little bit confused about how to read in > >> the reads / qualities. > >> > >> I had expected to say something like: > >> > >> x <- read454(fasta = "my.fas", qual.fasta = "my.qual") > >> > >> > >> or > >> > >> x <- read454(srf = "my.srf") > >> > >> > >> or > >> > >> x <- read454(fastq = "my.fastq") > >> > >> > >> > >> However, this is clearly not correct. "?read454" brings up the > >> "RochePath-class" page that suggests I try something like: > >> > >> x <- RochePath(readPath="./") > >> y<- read454(x) > >> > >> > >> But that fails too... "no input files found / pattern: \.fna$". I > >> tried to set pattern somewhere (to match the ".fas" / and ".qual" > >> patterns of my files), but I couldn't seem to set them anywhere. > >> > >> Can you help with an example of what I am doing wrong? > > > > > > Thanks for this great feedback. One way to achieve what you want would be > to > > call readFasta and readQual separately for each file, and then combine > the > > results into a ShortReadQ. In the devel version of ShortRead (it might > take > > a day to propagate), I just renamed read454 to readFastaQual and added > > separate pattern arguments for the fasta and qual file, so you could do: > > srq <- readFastaQual(".", "my.fas", "my.qual") > > > > How does that sound? I think that this interface could be made more > > The above style of invocation sounds great, but I don't see why its not > simply, > > srq <- readFastaQual("./my.fas", "./my.qual") > > > > convenient (such as having separate pattern arguments for the extensions > and > > then a "base" pattern), but I'm just trying to stay consistent with what > is > > ShortRead already. > > > Its good to be consistent, but I don't see any other R function that > works like this... i.e. when I read.table I don't have to think about > 'patterns' of any kind. I realise that this can be convenient when you > have the experimental output directory, but I think its more often > (and more flexible) to just pass the file names directly like any > other R function. > Well, I'm not sure if it's more often. It seems people usually have a bunch of samples they want to read in sequentially, rather than explicitly picking one out by name. The API is flexible enough to handle that, but I agree that it's a bit inconvenient to have to worry about the directory. > > Anyway, as I said originally, I'm very new to ShortRead and BioC in > general, so it could be that I'm just not used to this style of > thinking. > > Thanks very much for your help developing these tools. > > Since you mentioned that currently things are not that finalised with > regard to 454 checks, I'll include the code that I have been using > below, in the hope that it could be of some small use to you. > Thanks a lot for these suggestions! > > > All the best, > Dan. > > > > source("~/BiO/Lang/R/readFASTA.R") > source("~/BiO/Lang/R/readQualityFASTA.R") > > > > if(!interactive()){ > fasta <- > readFASTA("Data/FUBJ29X06.fas") > } > > head(fasta,2) > length(fasta) > > > > ## Recreate the histogram provided by the sequencing center > > ## I just like this code ;-) > ##plot(table(sapply(fasta, function(l){nchar(l[[2]])}))) > > ## Do the above, but in a way that saves compute time later... > > read.lengths <- > sapply(fasta, function(l){nchar(l[[2]])}) > > length.table <- > table(read.lengths) > class(length.table) > > plot(as.integer(length.table), > main="Frequency plot of the read length", > xlab="Read length", > ylab="Frequency") > > > > ## How many reads does the read with the most number of reads have > (nx <- max(length.table )) > > ## OK, but which length is it? > (lx <- as.integer(names(length.table[length.table == nx]))) > > ## To be absolutely clear: > nx # number of reads > lx # of this length. > > > > > > > > ## Now do something with the quality scores > if(!interactive()){ > fasta.qual <- > readQualityFASTA("Data/FUBJ29X06.qual") > } > > head(fasta.qual,2) > length(fasta.qual) > > > > ## Plot one example 'quality profile'... > plot(fasta.qual[[6]][[2]], type="l", lwd=2, > main="Quality score for a read", > xlab="Base position", > ylab="Quality score" > ) > > > > ## Plot the mean quality score across all the nx reads of length lx > ## (see above). > > ## Get the indexes of the reads of this length > I <- > which(read.lengths == lx) > length(I) > > head(I) > tail(I) > > ## Just checking ;-) > stopifnot(length(I) == nx) > > > > ## Read the qualities of these reads into a matrix > > m <- matrix( 0, nrow=nx, ncol=lx ) > > m[1:6,1:6] > > for(i in 1:nx){ > ## Double check the index is giving us the reads we expect > stopifnot(length( fasta.qual[[ I[i] ]][[2]] ) == lx) > > m[i,] = fasta.qual[[ I[i] ]][[2]] > } > > m[1:6,1:6] > > > > ## Now we can easily calculate the mean quality over the set of reads > > plot(apply(m, 2, mean), type='l', lwd=2, > ylim=c(0,50), > main=paste("Mean quality score for the", nx, "reads of length", lx, > "bp"), > xlab="Base position", > ylab="Mean quality score", > ) > > lines(apply(m, 2, mean) + > apply(m, 2, sd), type='l', > ylim=c(0,40), col=2) > > lines(apply(m, 2, mean) - > apply(m, 2, sd), type='l', > ylim=c(0,40), col=2) > > abline(h=20, col=2, lty=2) > > legend(x="topright", inset=0.05, bg="white", > leg=expression("" %+-% 1 * "s.d."), col=2, lty=1) > > > > > > ## Plot some overall quality score statistics > > ## Reminder > class(fasta.qual) > class(fasta.qual[[1]]) > class(fasta.qual[[1]][[2]]) > > base.qualities <- > unlist(lapply(fasta.qual, function(l){l[[2]]})) > > length(base.qualities) > > ## Sanity check > stopifnot(sum(read.lengths) == length(base.qualities)) > > > > if(!interactive()){ > ecdf.base.qualities <- > ecdf(base.qualities) > } > > plot(ecdf.base.qualities, verticals = TRUE, > main="Empirical Cumulative Distribution Function", > xlab="Quality score", > ylab="Fraction of bases") > > abline(v=20, col=2, lty=2) > abline(h=ecdf.base.qualities(20), col=2, lty=2) > > abline(v=35, col=4, lty=4) > abline(h=ecdf.base.qualities(35), col=4, lty=4) > > > > read.qualities <- > sapply(fasta.qual, function(l){mean(l[[2]])}) > > hist(read.qualities, > main="Histogram of mean read quality score", > xlab="Mean quality score of a read", > ylab="Number of reads") > > ## Too dense > #rug(read.qualities) > > > > > > ## Plot mean quality against length > > plot(x=read.lengths, > y=read.qualities) > > cor(read.lengths, read.qualities) > cor.test(read.lengths, read.qualities) > cor.test(read.lengths, read.qualities, method="pearson") > > (my.lm <- lm(read.qualities~read.lengths)) > > summary(my.lm) > > abline(coef(my.lm), col=2) > > > ## Test that I understand the slope properly > > (tx1 <- 100) > (ty1 <- (2.683e+01 + tx1 * 8.974e-03)) > points(tx1, ty1, col=3, pch=4, cex=2, lwd=3) > > (tx2 <- 400) > (ty2 <- (2.683e+01 + tx2 * 8.974e-03)) > points(tx2, ty2, col=3, pch=4, cex=2, lwd=3) > > > > > > ## Lets filter that by read length... > > read.length.filter <- read.lengths > 200 > > > ## But look at this > (xx <- sum(read.lengths)) > (yy <- sum(read.lengths[read.length.filter])) > > round((xx - yy) / xx * 100, 1) > > > base.qualities.filtered <- > unlist(lapply(fasta.qual[read.length.filter], function(l){l[[2]]})) > > length(base.qualities) == xx > length(base.qualities.filtered) == yy > > my.breaks <- > c( 0, > 2, 4, 6, 8, 10, > 12, 14, 16, 18, 20, > 22, 24, 26, 28, 30, > 32, 34, 36, 38, 40) > > q.t <- table(base.qualities) > q.f.t <- table(base.qualities.filtered) > > length(q.t) > length(q.f.t) > > > > barplot(rbind(q.t, q.f.t), beside=TRUE, > main="Distribution of base qualities", > xlab="Quality score", > ylab="Number of bases") > > legend(x="topleft", inset=0.05, > fill=c(grey(0.4), grey(0.9)), > leg=c( > "full set", > "length filtered") > ) > > -------- > > --- /local/Scratch/dbolser/BiO/Lang/R/readFASTA.R 2008-08-26 > 12:42:39.000000000 +0100 > +++ /local/Scratch/dbolser/BiO/Lang/R/readQualityFASTA.R > 2008-08-26 13:11:34.000000000 +0100 > @@ -1,5 +1,5 @@ > > -readFASTA <- > +readQualityFASTA <- > function (file, checkComments = TRUE, strip.desc = TRUE) > { > if (missing(strip.desc)) > @@ -37,7 +37,8 @@ > desc <- s1[descriptions[i]] > if (strip.desc) > desc <- substr(desc, 2L, nchar(desc)) > - seq <- paste(s1[dp[i]:end[i]], collapse = "") > + seq <- paste(s1[dp[i]:end[i]], collapse = " ") > + seq <- as.numeric(strsplit(seq, "\\s+", perl=TRUE)[[1]]) > list(desc = desc, seq = seq) > }) > } > > -------- > > readFASTA <- > function (file, checkComments = TRUE, strip.desc = TRUE) > { > if (missing(strip.desc)) > warning("use 'strip.desc=FALSE' for compatibility with old version\n", > " of readFASTA() or 'strip.desc=TRUE' to remove the \">\"\n", > " marking the beginning of the description lines and get\n", > " rid of this warning (see '?readFASTA' for more details)") > if (is.character(file)) { > file <- file(file, "r") > on.exit(close(file)) > } > else { > if (!inherits(file, "connection")) > stop("'file' must be a character string or connection") > if (!isOpen(file)) { > open(file, "r") > on.exit(close(file)) > } > } > s1 <- scan(file = file, what = "", sep = "\n", quote = "", > allowEscapes = FALSE) > if (checkComments) { > comments <- grep("^;", s1) > if (length(comments) > 0) > s1 <- s1[-comments] > } > descriptions <- which(substr(s1, 1L, 1L) == ">") > numF <- length(descriptions) > if (numF == 0) > stop("no FASTA sequences found") > dp <- descriptions + 1L > dm <- descriptions - 1L > end <- c(dm[-1], length(s1)) > lapply(seq_len(numF), function(i) { > desc <- s1[descriptions[i]] > if (strip.desc) > desc <- substr(desc, 2L, nchar(desc)) > seq <- paste(s1[dp[i]:end[i]], collapse = "") > list(desc = desc, seq = seq) > }) > } > > -------- > > readQualityFASTA <- > function (file, checkComments = TRUE, strip.desc = TRUE) > { > if (missing(strip.desc)) > warning("use 'strip.desc=FALSE' for compatibility with old version\n", > " of readFASTA() or 'strip.desc=TRUE' to remove the \">\"\n", > " marking the beginning of the description lines and get\n", > " rid of this warning (see '?readFASTA' for more details)") > if (is.character(file)) { > file <- file(file, "r") > on.exit(close(file)) > } > else { > if (!inherits(file, "connection")) > stop("'file' must be a character string or connection") > if (!isOpen(file)) { > open(file, "r") > on.exit(close(file)) > } > } > s1 <- scan(file = file, what = "", sep = "\n", quote = "", > allowEscapes = FALSE) > if (checkComments) { > comments <- grep("^;", s1) > if (length(comments) > 0) > s1 <- s1[-comments] > } > descriptions <- which(substr(s1, 1L, 1L) == ">") > numF <- length(descriptions) > if (numF == 0) > stop("no FASTA sequences found") > dp <- descriptions + 1L > dm <- descriptions - 1L > end <- c(dm[-1], length(s1)) > lapply(seq_len(numF), function(i) { > desc <- s1[descriptions[i]] > if (strip.desc) > desc <- substr(desc, 2L, nchar(desc)) > seq <- paste(s1[dp[i]:end[i]], collapse = " ") > seq <- as.numeric(strsplit(seq, "\\s+", perl=TRUE)[[1]]) > list(desc = desc, seq = seq) > }) > } > > -------- > > > > > Michael > > > >> > >> > >> How come I need to specify a directory and a pattern when what I > >> really want to do is just to specify a file (because that is what I > >> have)? > >> > >> > >> Thanks for any help. > >> > >> All the best, > >> Dan. > >> > >> > >> > >> > The basic touch points of the qa report for read (i.e., not aligned) > >> > data are numbers of reads, nucleotide frequencies > >> > > >> > alphabetFrequency(sread(srq), baseOnly=TRUE, collapse=TRUE) > >> > > >> > and cycle-specific alphabet frequencies and average quality scores > >> > (use alphabetByCycle on sread(srq) and quality(srq)). For 454 it seems > >> > like a simple plot of average quality score, along the lines of > >> > alphabetScore(quality(srq)) / width(quality(srq)) against > >> > width(quality(srq)) can also be quite insightful. There might be > >> > issues where the functions expect / it makes sense to do analysis on > >> > uniform-width reads, or on groups of uniformly-widthed reads. > >> > > >> > Sorry for the only limited help. > >> > > >> > Martin > >> > > >> > > >> >> Thanks for any help, > >> >> Dan. > >> >> > >> >> _______________________________________________ > >> >> Bioc-sig-sequencing mailing list > >> >> [email protected] > >> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > >> > > >> > -- > >> > Martin Morgan > >> > Computational Biology / Fred Hutchinson Cancer Research Center > >> > 1100 Fairview Ave. N. > >> > PO Box 19024 Seattle, WA 98109 > >> > > >> > Location: Arnold Building M1 B861 > >> > Phone: (206) 667-2793 > >> > > >> > >> _______________________________________________ > >> Bioc-sig-sequencing mailing list > >> [email protected] > >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > > > > > [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
