Re: [Bioc-sig-seq] Standard 454 quality checks?

Michael Lawrence Mon, 04 May 2009 05:53:37 -0700

On Mon, May 4, 2009 at 5:25 AM, Dan Bolser <[email protected]> wrote:


> 2009/5/4 Michael Lawrence <[email protected]>:
> >
> >
> > On Mon, May 4, 2009 at 2:19 AM, Dan Bolser <[email protected]> wrote:
> >>
> >> 2009/4/23 Martin Morgan <[email protected]>:
> >> > Dan Bolser <[email protected]> writes:
> >> >
> >> >> Sorry for the noob question, but is there a set of standard quality
> >> >> checks in R that I can run over some 454 data? I have the fasta and
> >> >> the fasta format quality files as well as an sff. I scanned the
> manual
> >> >> for the ShortReads package, but it seems focused on Illumina, I
> >> >> couldn't pick out the general bits from the specifics.
> >> >
> >> > Hi Dan --
> >> >
> >> > You might be on somewhat uncharted territory here; most of our
> >> > experience (though we have some 454 data now) is with Solexa.
> >> >
> >> > I don't think the standard QA pipeline, along the lines of
> >> > report(qa(<...>)), will work at the moment, but I'll try to add that
> >> > today.
> >> >
> >> > You should be able to read the fasta and quality scores with
> >> > read454(). This returns a 'ShortReadQ' object, srq, that bundles the
> >> > reads, their quality scores, and their ids.
> >>
> >> Hi Martin, thanks for the info on this. I'm have only just got round
> >> to looking at this but I'm a little bit confused about how to read in
> >> the reads / qualities.
> >>
> >> I had expected to say something like:
> >>
> >> x <- read454(fasta = "my.fas", qual.fasta = "my.qual")
> >>
> >>
> >> or
> >>
> >> x <- read454(srf = "my.srf")
> >>
> >>
> >> or
> >>
> >> x <- read454(fastq = "my.fastq")
> >>
> >>
> >>
> >> However, this is clearly not correct. "?read454" brings up the
> >> "RochePath-class" page that suggests I try something like:
> >>
> >> x <- RochePath(readPath="./")
> >> y<- read454(x)
> >>
> >>
> >> But that fails too... "no input files found / pattern: \.fna$". I
> >> tried to set pattern somewhere (to match the ".fas" / and ".qual"
> >> patterns of my files), but I couldn't seem to set them anywhere.
> >>
> >> Can you help with an example of what I am doing wrong?
> >
> >
> > Thanks for this great feedback. One way to achieve what you want would be
> to
> > call readFasta and readQual separately for each file, and then combine
> the
> > results into a ShortReadQ. In the devel version of ShortRead (it might
> take
> > a day to propagate), I just renamed read454 to readFastaQual and added
> > separate pattern arguments for the fasta and qual file, so you could do:
> > srq <- readFastaQual(".", "my.fas", "my.qual")
> >
> > How does that sound? I think that this interface could be made more
>
> The above style of invocation sounds great, but I don't see why its not
> simply,
>
> srq <- readFastaQual("./my.fas", "./my.qual")
>
>
> > convenient (such as having separate pattern arguments for the extensions
> and
> > then a "base" pattern), but I'm just trying to stay consistent with what
> is
> > ShortRead already.
>
>
> Its good to be consistent, but I don't see any other R function that
> works like this... i.e. when I read.table I don't have to think about
> 'patterns' of any kind. I realise that this can be convenient when you
> have the experimental output directory, but I think its more often
> (and more flexible) to just pass the file names directly like any
> other R function.
>

Well, I'm not sure if it's more often. It seems people usually have a bunch
of samples they want to read in sequentially, rather than explicitly picking
one out by name. The API is flexible enough to handle that, but I agree that
it's a bit inconvenient to have to worry about the directory.


>
> Anyway, as I said originally, I'm very new to ShortRead and BioC in
> general, so it could be that I'm just not used to this style of
> thinking.
>
> Thanks very much for your help developing these tools.
>
> Since you mentioned that currently things are not that finalised with
> regard to 454 checks, I'll include the code that I have been using
> below, in the hope that it could be of some small use to you.
>

Thanks a lot for these suggestions!


>
>
> All the best,
> Dan.
>
>
>
> source("~/BiO/Lang/R/readFASTA.R")
> source("~/BiO/Lang/R/readQualityFASTA.R")
>
>
>
> if(!interactive()){
>  fasta <-
>    readFASTA("Data/FUBJ29X06.fas")
> }
>
> head(fasta,2)
> length(fasta)
>
>
>
> ## Recreate the histogram provided by the sequencing center
>
> ## I just like this code ;-)
> ##plot(table(sapply(fasta, function(l){nchar(l[[2]])})))
>
> ## Do the above, but in a way that saves compute time later...
>
> read.lengths <-
>  sapply(fasta, function(l){nchar(l[[2]])})
>
> length.table <-
>  table(read.lengths)
> class(length.table)
>
> plot(as.integer(length.table),
>     main="Frequency plot of the read length",
>     xlab="Read length",
>     ylab="Frequency")
>
>
>
> ## How many reads does the read with the most number of reads have
> (nx <- max(length.table ))
>
> ## OK, but which length is it?
> (lx <- as.integer(names(length.table[length.table == nx])))
>
> ## To be absolutely clear:
> nx # number of reads
> lx # of this length.
>
>
>
>
>
>
>
> ## Now do something with the quality scores
> if(!interactive()){
>  fasta.qual <-
>    readQualityFASTA("Data/FUBJ29X06.qual")
> }
>
> head(fasta.qual,2)
> length(fasta.qual)
>
>
>
> ## Plot one example 'quality profile'...
> plot(fasta.qual[[6]][[2]], type="l", lwd=2,
>     main="Quality score for a read",
>     xlab="Base position",
>     ylab="Quality score"
>     )
>
>
>
> ## Plot the mean quality score across all the nx reads of length lx
> ## (see above).
>
> ## Get the indexes of the reads of this length
> I <-
>  which(read.lengths == lx)
> length(I)
>
> head(I)
> tail(I)
>
> ## Just checking ;-)
> stopifnot(length(I) == nx)
>
>
>
> ## Read the qualities of these reads into a matrix
>
> m <- matrix( 0, nrow=nx, ncol=lx )
>
> m[1:6,1:6]
>
> for(i in 1:nx){
>  ## Double check the index is giving us the reads we expect
>  stopifnot(length( fasta.qual[[ I[i] ]][[2]] ) == lx)
>
>  m[i,] = fasta.qual[[ I[i] ]][[2]]
> }
>
> m[1:6,1:6]
>
>
>
> ## Now we can easily calculate the mean quality over the set of reads
>
> plot(apply(m, 2, mean), type='l', lwd=2,
>     ylim=c(0,50),
>     main=paste("Mean quality score for the", nx, "reads of length", lx,
> "bp"),
>     xlab="Base position",
>     ylab="Mean quality score",
>     )
>
> lines(apply(m, 2, mean) +
>      apply(m, 2,   sd), type='l',
>      ylim=c(0,40), col=2)
>
> lines(apply(m, 2, mean) -
>      apply(m, 2,   sd), type='l',
>      ylim=c(0,40), col=2)
>
> abline(h=20, col=2, lty=2)
>
> legend(x="topright", inset=0.05, bg="white",
>       leg=expression("" %+-% 1 * "s.d."), col=2, lty=1)
>
>
>
>
>
> ## Plot some overall quality score statistics
>
> ## Reminder
> class(fasta.qual)
> class(fasta.qual[[1]])
> class(fasta.qual[[1]][[2]])
>
> base.qualities <-
>  unlist(lapply(fasta.qual, function(l){l[[2]]}))
>
> length(base.qualities)
>
> ## Sanity check
> stopifnot(sum(read.lengths) == length(base.qualities))
>
>
>
> if(!interactive()){
>  ecdf.base.qualities <-
>    ecdf(base.qualities)
> }
>
> plot(ecdf.base.qualities, verticals = TRUE,
>     main="Empirical Cumulative Distribution Function",
>     xlab="Quality score",
>     ylab="Fraction of bases")
>
> abline(v=20, col=2, lty=2)
> abline(h=ecdf.base.qualities(20), col=2, lty=2)
>
> abline(v=35, col=4, lty=4)
> abline(h=ecdf.base.qualities(35), col=4, lty=4)
>
>
>
> read.qualities <-
>  sapply(fasta.qual, function(l){mean(l[[2]])})
>
> hist(read.qualities,
>     main="Histogram of mean read quality score",
>     xlab="Mean quality score of a read",
>     ylab="Number of reads")
>
> ## Too dense
> #rug(read.qualities)
>
>
>
>
>
> ## Plot mean quality against length
>
> plot(x=read.lengths,
>     y=read.qualities)
>
> cor(read.lengths, read.qualities)
> cor.test(read.lengths, read.qualities)
> cor.test(read.lengths, read.qualities, method="pearson")
>
> (my.lm <- lm(read.qualities~read.lengths))
>
> summary(my.lm)
>
> abline(coef(my.lm), col=2)
>
>
> ## Test that I understand the slope properly
>
> (tx1 <- 100)
> (ty1 <- (2.683e+01 + tx1 * 8.974e-03))
> points(tx1, ty1, col=3, pch=4, cex=2, lwd=3)
>
> (tx2 <- 400)
> (ty2 <- (2.683e+01 + tx2 * 8.974e-03))
> points(tx2, ty2, col=3, pch=4, cex=2, lwd=3)
>
>
>
>
>
> ## Lets filter that by read length...
>
> read.length.filter <- read.lengths > 200
>
>
> ## But look at this
> (xx <- sum(read.lengths))
> (yy <- sum(read.lengths[read.length.filter]))
>
> round((xx - yy) / xx * 100, 1)
>
>
> base.qualities.filtered <-
>  unlist(lapply(fasta.qual[read.length.filter], function(l){l[[2]]}))
>
> length(base.qualities) == xx
> length(base.qualities.filtered) == yy
>
> my.breaks <-
>  c( 0,
>    2,  4,  6,  8, 10,
>    12, 14, 16, 18, 20,
>    22, 24, 26, 28, 30,
>    32, 34, 36, 38, 40)
>
> q.t <- table(base.qualities)
> q.f.t <- table(base.qualities.filtered)
>
> length(q.t)
> length(q.f.t)
>
>
>
> barplot(rbind(q.t, q.f.t), beside=TRUE,
>        main="Distribution of base qualities",
>        xlab="Quality score",
>        ylab="Number of bases")
>
> legend(x="topleft", inset=0.05,
>       fill=c(grey(0.4), grey(0.9)),
>       leg=c(
>         "full set",
>         "length filtered")
>       )
>
> --------
>
> --- /local/Scratch/dbolser/BiO/Lang/R/readFASTA.R       2008-08-26
> 12:42:39.000000000 +0100
> +++ /local/Scratch/dbolser/BiO/Lang/R/readQualityFASTA.R
> 2008-08-26 13:11:34.000000000 +0100
> @@ -1,5 +1,5 @@
>
> -readFASTA <-
> +readQualityFASTA <-
>   function (file, checkComments = TRUE, strip.desc = TRUE)
>  {
>   if (missing(strip.desc))
> @@ -37,7 +37,8 @@
>     desc <- s1[descriptions[i]]
>     if (strip.desc)
>       desc <- substr(desc, 2L, nchar(desc))
> -    seq <- paste(s1[dp[i]:end[i]], collapse = "")
> +    seq <- paste(s1[dp[i]:end[i]], collapse = " ")
> +    seq <- as.numeric(strsplit(seq, "\\s+", perl=TRUE)[[1]])
>     list(desc = desc, seq = seq)
>   })
>  }
>
> --------
>
> readFASTA <-
>  function (file, checkComments = TRUE, strip.desc = TRUE)
> {
>  if (missing(strip.desc))
>    warning("use 'strip.desc=FALSE' for compatibility with old version\n",
>            "  of readFASTA() or 'strip.desc=TRUE' to remove the \">\"\n",
>            "  marking the beginning of the description lines and get\n",
>            "  rid of this warning (see '?readFASTA' for more details)")
>  if (is.character(file)) {
>    file <- file(file, "r")
>    on.exit(close(file))
>  }
>  else {
>    if (!inherits(file, "connection"))
>      stop("'file' must be a character string or connection")
>    if (!isOpen(file)) {
>      open(file, "r")
>      on.exit(close(file))
>    }
>  }
>  s1 <- scan(file = file, what = "", sep = "\n", quote = "",
>             allowEscapes = FALSE)
>  if (checkComments) {
>    comments <- grep("^;", s1)
>    if (length(comments) > 0)
>      s1 <- s1[-comments]
>  }
>  descriptions <- which(substr(s1, 1L, 1L) == ">")
>  numF <- length(descriptions)
>  if (numF == 0)
>    stop("no FASTA sequences found")
>  dp <- descriptions + 1L
>  dm <- descriptions - 1L
>  end <- c(dm[-1], length(s1))
>  lapply(seq_len(numF), function(i) {
>    desc <- s1[descriptions[i]]
>    if (strip.desc)
>      desc <- substr(desc, 2L, nchar(desc))
>    seq <- paste(s1[dp[i]:end[i]], collapse = "")
>    list(desc = desc, seq = seq)
>  })
> }
>
> --------
>
> readQualityFASTA <-
>  function (file, checkComments = TRUE, strip.desc = TRUE)
> {
>  if (missing(strip.desc))
>    warning("use 'strip.desc=FALSE' for compatibility with old version\n",
>            "  of readFASTA() or 'strip.desc=TRUE' to remove the \">\"\n",
>            "  marking the beginning of the description lines and get\n",
>            "  rid of this warning (see '?readFASTA' for more details)")
>  if (is.character(file)) {
>    file <- file(file, "r")
>    on.exit(close(file))
>  }
>  else {
>    if (!inherits(file, "connection"))
>      stop("'file' must be a character string or connection")
>    if (!isOpen(file)) {
>      open(file, "r")
>      on.exit(close(file))
>    }
>  }
>  s1 <- scan(file = file, what = "", sep = "\n", quote = "",
>             allowEscapes = FALSE)
>  if (checkComments) {
>    comments <- grep("^;", s1)
>    if (length(comments) > 0)
>      s1 <- s1[-comments]
>  }
>  descriptions <- which(substr(s1, 1L, 1L) == ">")
>  numF <- length(descriptions)
>  if (numF == 0)
>    stop("no FASTA sequences found")
>  dp <- descriptions + 1L
>  dm <- descriptions - 1L
>  end <- c(dm[-1], length(s1))
>  lapply(seq_len(numF), function(i) {
>    desc <- s1[descriptions[i]]
>    if (strip.desc)
>      desc <- substr(desc, 2L, nchar(desc))
>    seq <- paste(s1[dp[i]:end[i]], collapse = " ")
>    seq <- as.numeric(strsplit(seq, "\\s+", perl=TRUE)[[1]])
>    list(desc = desc, seq = seq)
>  })
> }
>
> --------
>
>
>
> > Michael
> >
> >>
> >>
> >> How come I need to specify a directory and a pattern when what I
> >> really want to do is just to specify a file (because that is what I
> >> have)?
> >>
> >>
> >> Thanks for any help.
> >>
> >> All the best,
> >> Dan.
> >>
> >>
> >>
> >> > The basic touch points of the qa report for read (i.e., not aligned)
> >> > data are numbers of reads, nucleotide frequencies
> >> >
> >> >  alphabetFrequency(sread(srq), baseOnly=TRUE, collapse=TRUE)
> >> >
> >> > and cycle-specific alphabet frequencies and average quality scores
> >> > (use alphabetByCycle on sread(srq) and quality(srq)). For 454 it seems
> >> > like a simple plot of average quality score, along the lines of
> >> > alphabetScore(quality(srq)) / width(quality(srq)) against
> >> > width(quality(srq)) can also be quite insightful. There might be
> >> > issues where the functions expect / it makes sense to do analysis on
> >> > uniform-width reads, or on groups of uniformly-widthed reads.
> >> >
> >> > Sorry for the only limited help.
> >> >
> >> > Martin
> >> >
> >> >
> >> >> Thanks for any help,
> >> >> Dan.
> >> >>
> >> >> _______________________________________________
> >> >> Bioc-sig-sequencing mailing list
> >> >> [email protected]
> >> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >> >
> >> > --
> >> > Martin Morgan
> >> > Computational Biology / Fred Hutchinson Cancer Research Center
> >> > 1100 Fairview Ave. N.
> >> > PO Box 19024 Seattle, WA 98109
> >> >
> >> > Location: Arnold Building M1 B861
> >> > Phone: (206) 667-2793
> >> >
> >>
> >> _______________________________________________
> >> Bioc-sig-sequencing mailing list
> >> [email protected]
> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >
> >
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Standard 454 quality checks?

Reply via email to