Author: tille Date: 2015-07-17 21:14:54 +0000 (Fri, 17 Jul 2015) New Revision: 19631
Added: trunk/packages/R/r-bioc-bsgenome/trunk/debian/patches/ trunk/packages/R/r-bioc-bsgenome/trunk/debian/patches/remove_paragraphs_bound_to_fail_from_vignette.patch trunk/packages/R/r-bioc-bsgenome/trunk/debian/patches/series Modified: trunk/packages/R/r-bioc-bsgenome/trunk/debian/changelog Log: New upstream version (+commit patches subdir) Modified: trunk/packages/R/r-bioc-bsgenome/trunk/debian/changelog =================================================================== --- trunk/packages/R/r-bioc-bsgenome/trunk/debian/changelog 2015-07-17 19:14:07 UTC (rev 19630) +++ trunk/packages/R/r-bioc-bsgenome/trunk/debian/changelog 2015-07-17 21:14:54 UTC (rev 19631) @@ -1,3 +1,9 @@ +r-bioc-bsgenome (1.36.2-1) unstable; urgency=medium + + * New upstream version + + -- Andreas Tille <[email protected]> Fri, 17 Jul 2015 22:50:12 +0200 + r-bioc-bsgenome (1.36.1-1) unstable; urgency=medium * New upstream version Added: trunk/packages/R/r-bioc-bsgenome/trunk/debian/patches/remove_paragraphs_bound_to_fail_from_vignette.patch =================================================================== --- trunk/packages/R/r-bioc-bsgenome/trunk/debian/patches/remove_paragraphs_bound_to_fail_from_vignette.patch (rev 0) +++ trunk/packages/R/r-bioc-bsgenome/trunk/debian/patches/remove_paragraphs_bound_to_fail_from_vignette.patch 2015-07-17 21:14:54 UTC (rev 19631) @@ -0,0 +1,511 @@ +Author: Andreas Tille <[email protected]> +Last-Update: Tue, 21 Oct 2014 05:22:51 +0200 +Description: In the autopkgtest we are trying to reproduce the creation of + the documentation but some paragraphs are failing due to not yet packaged + data packages. These paragraphs are deleted for the moment to enable + successful testing. + . + The fact that also DNAStringSet() does not work should be furtherly + investigated by some BioConductor experts. + + +--- a/vignettes/GenomeSearching.Rnw ++++ b/vignettes/GenomeSearching.Rnw +@@ -107,135 +107,7 @@ The BSgenome data package for the ce2 ge + available in Bioconductor but they could be added if there is demand for + them. + +-See \Rfunction{?available.genomes} for how to install +-\Rpackage{BSgenome.Celegans.UCSC.ce2}. +-Then load the package and display the single object defined in it: +-<<b1>>= +-library(BSgenome.Celegans.UCSC.ce2) +-ls("package:BSgenome.Celegans.UCSC.ce2") +-genome <- BSgenome.Celegans.UCSC.ce2 +-genome +-@ +- +-\Robject{genome} is a \Rclass{BSgenome} object: +-<<b2>>= +-class(genome) +-@ +- +-When displayed, some basic information about the origin of the +-genome is shown (organism, provider, provider version, etc...) +-followed by the index of {\it single} sequences and eventually +-an additional index of {\it multiple} sequences. +-Methods (adequately called {\it accessor methods}) are defined +-for individual access to this information: +-<<b3>>= +-organism(genome) +-provider(genome) +-providerVersion(genome) +-seqnames(genome) +-mseqnames(genome) +-@ +- +-See the man page for the \Rclass{BSgenome} class (\Rfunction{?BSgenome}) +-for a complete list of accessor methods and their descriptions. +- +-Now we are ready to display chromosome I: +-<<b4>>= +-genome$chrI +-@ +- +-Note that this chrI sequence corresponds to the {\it forward} strand +-(aka {\it direct} or {\it sense} or {\it positive} or {\it plus} strand) +-of chromosome I. +-UCSC, and genome providers in general, don't provide files containing the +-nucleotide sequence of the {\it reverse} strand (aka {\it indirect} +-or {\it antisense} or {\it negative} or {\it minus} or {\it opposite} strand) +-of the chromosomes because these sequences can be deduced from the {\it forward} +-sequences by taking their reverse complements. +-The BSgenome data packages are no exceptions: they only +-provide the {\it forward} strand sequence of every chromosome. +-See \Rfunction{?reverseComplement} for more details about the reverse +-complement of a \Rclass{DNAString} object. +-It is important to remember that, in practice, the {\it reverse} strand +-sequence is almost never needed. +-The reason is that, in fact, a {\it reverse} strand analysis +-can (and should) always be transposed into a {\it forward} strand analysis. +-Therefore trying to compute the {\it reverse} strand sequence of an entire +-chromosome by applying \Rfunction{reverseComplement} to its {\it forward} +-strand sequence is almost always a bad idea. +-See the {\it Finding an arbitrary nucleotide pattern in an entire genome} section +-of this document for how to find arbitrary patterns in the {\it reverse} strand +-of a chromosome. +- +-% It seems like this page http://www.medterms.com/script/main/art.asp?articlekey=20468 +-% is lying about the noncoding (or coding, they are in fact contradicting themselves) +-% nature of the sense and antisense strands. +- +-The number of bases in this sequence can be retrieved with: +-<<b5>>= +-chrI <- genome$chrI +-length(chrI) +-@ +- +-Some basic stats: +-<<b6>>= +-afI <- alphabetFrequency(chrI) +-afI +-sum(afI) == length(chrI) +-@ +- +-Count all {\it exact} matches of pattern \Robject{"ACCCAGGGC"}: +-<<b7>>= +-p1 <- "ACCCAGGGC" +-countPattern(p1, chrI) +-@ +- +-Like most pattern matching functions in \Rpackage{Biostrings}, +-the \Rfunction{countPattern} and \Rfunction{matchPattern} functions +-support {\it inexact} matching. One form of inexact matching is to +-allow a few mismatching letters per match. Here we allow at most one: +-<<b8>>= +-countPattern(p1, chrI, max.mismatch=1) +-@ +- +-With the \Rfunction{matchPattern} function, the locations of the matches are +-stored in an \Rclass{XStringViews} object: +-<<b9>>= +-m1 <- matchPattern(p1, chrI, max.mismatch=1) +-m1[4:6] +-class(m1) +-@ +- +-The \Rfunction{mismatch} function (new in \Rpackage{Biostrings}~2) +-returns the positions of the mismatching letters for each match: +-<<b10>>= +-mismatch(p1, m1[4:6]) +-@ +- +-Note: The \Rfunction{mismatch} method is in fact a particular case +-of a (vectorized) {\it alignment} function where only ``replacements'' +-are allowed. Current implementation is slow but this will be addressed. +- +-It may happen that a match is {\it out of limits} like in this example: +-<<b11>>= +-p2 <- DNAString("AAGCCTAAGCCTAAGCCTAA") +-m2 <- matchPattern(p2, chrI, max.mismatch=2) +-m2[1:4] +-p2 == m2[1:4] +-mismatch(p2, m2[1:4]) +-@ +- +-The list of exact matches and the list of inexact matches +-can both be obtained with: +-<<b12,results=hide>>= +-m2[p2 == m2] +-m2[p2 != m2] +-@ +- +-Note that the length of \Robject{m2[p2 == m2]} should be +-equal to \Robject{countPattern(p2, chrI, max.mismatch=0)}. +- +- ++% DELETED + + % --------------------------------------------------------------------------- + +@@ -297,158 +169,7 @@ More precisely, here is the analysis we + + \end{itemize} + +-Let's start by loading the input dictionary with: +-<<c1>>= +-ce2dict0_file <- system.file("extdata", "ce2dict0.fa", package="BSgenome") +-ce2dict0 <- readDNAStringSet(ce2dict0_file, "fasta") +-ce2dict0 +-@ +- +-Here is how we can write the functions that will perform our analysis: +-<<c2>>= +-writeHits <- function(seqname, matches, strand, file="", append=FALSE) +-{ +- if (file.exists(file) && !append) +- warning("existing file ", file, " will be overwritten with 'append=FALSE'") +- if (!file.exists(file) && append) +- warning("new file ", file, " will have no header with 'append=TRUE'") +- hits <- data.frame(seqname=rep.int(seqname, length(matches)), +- start=start(matches), +- end=end(matches), +- strand=rep.int(strand, length(matches)), +- patternID=names(matches), +- check.names=FALSE) +- write.table(hits, file=file, append=append, quote=FALSE, sep="\t", +- row.names=FALSE, col.names=!append) +-} +- +-runAnalysis1 <- function(dict0, outfile="") +-{ +- library(BSgenome.Celegans.UCSC.ce2) +- genome <- BSgenome.Celegans.UCSC.ce2 +- seqnames <- seqnames(genome) +- seqnames_in1string <- paste(seqnames, collapse=", ") +- cat("Target:", providerVersion(genome), +- "chromosomes", seqnames_in1string, "\n") +- append <- FALSE +- for (seqname in seqnames) { +- subject <- genome[[seqname]] +- cat(">>> Finding all hits in chromosome", seqname, "...\n") +- for (i in seq_len(length(dict0))) { +- patternID <- names(dict0)[i] +- pattern <- dict0[[i]] +- plus_matches <- matchPattern(pattern, subject) +- names(plus_matches) <- rep.int(patternID, length(plus_matches)) +- writeHits(seqname, plus_matches, "+", file=outfile, append=append) +- append <- TRUE +- rcpattern <- reverseComplement(pattern) +- minus_matches <- matchPattern(rcpattern, subject) +- names(minus_matches) <- rep.int(patternID, length(minus_matches)) +- writeHits(seqname, minus_matches, "-", file=outfile, append=append) +- } +- cat(">>> DONE\n") +- } +-} +-@ +- +-Some important notes about the implementation of the \Rfunction{runAnalysis1} +-function: +-\begin{itemize} +- \item{} +- \Robject{subject <- genome[[seqname]]} is the code that actually loads a +- chromosome sequence into memory. +- Using only one sequence at a time is a good practice to avoid memory +- allocation problems on a machine with a limited amount of memory. +- For example, loading all the human chromosome sequences in memory would +- require more than 3GB of memory! +- +- \item{} +- We have 2 nested \Robject{for} loops: the outer loop walks thru the +- target (7 chromosomes) and the inner loop walks thru the set of +- patterns. Doing the other way around would be very inefficient, +- especially with a bigger number of patterns because this would require +- to load each chromosome sequence into memory as many times as the +- number of patterns. +- \Rfunction{runAnalysis1} loads each sequence only once. +- +- \item{} +- We find the matches in the minus strand (\Robject{minus_matches}) by +- first taking the reverse complement of the current pattern (with +- \Robject{rcpattern <- reverseComplement(pattern)}) and NOT by +- taking the reverse complement of the current subject. +-\end{itemize} +- +-Now we are ready to run the analysis and put the results in the +-\Robject{"ce2dict0_ana1.txt"} file: +-<<c3>>= +-runAnalysis1(ce2dict0, outfile="ce2dict0_ana1.txt") +-@ +- +-Here is some very simple example of post analysis: +-\begin{itemize} +- \item{} +-Get the total number of hits: +-<<c4>>= +-hits1 <- read.table("ce2dict0_ana1.txt", header=TRUE) +-nrow(hits1) +-@ +- \item{} +-Get the number of hits per chromosome: +-<<c5>>= +-table(hits1$seqname) +-@ +- \item{} +-Get the number of hits per pattern: +-<<c6>>= +-hits1_table <- table(hits1$patternID) +-hits1_table +-@ +- \item{} +-Get the pattern(s) with the higher number of hits: +-<<c7>>= +-hits1_table[hits1_table == max(hits1_table)] # pattern(s) with more hits +-@ +- \item{} +-Get the pattern(s) with no hits: +-<<c8>>= +-setdiff(names(ce2dict0), hits1$patternID) # pattern(s) with no hits +-@ +- \item{} +-And finally a function that can be used to plot the hits: +-<<c9>>= +-plotGenomeHits <- function(bsgenome, seqnames, hits) +-{ +- chrlengths <- seqlengths(bsgenome)[seqnames] +- XMAX <- max(chrlengths) +- YMAX <- length(seqnames) +- plot.new() +- plot.window(c(1, XMAX), c(0, YMAX)) +- axis(1) +- axis(2, at=seq_len(length(seqnames)), labels=rev(seqnames), tick=FALSE, las=1) +- ## Plot the chromosomes +- for (i in seq_len(length(seqnames))) +- lines(c(1, chrlengths[i]), c(YMAX + 1 - i, YMAX + 1 - i), type="l") +- ## Plot the hits +- for (i in seq_len(nrow(hits))) { +- seqname <- hits$seqname[i] +- y0 <- YMAX + 1 - match(seqname, seqnames) +- if (hits$strand[i] == "+") { +- y <- y0 + 0.05 +- col <- "red" +- } else { +- y <- y0 - 0.05 +- col <- "blue" +- } +- lines(c(hits$start[i], hits$end[i]), c(y, y), type="l", col=col, lwd=3) +- } +-} +-@ +-Plot the hits found by \Rfunction{runAnalysis1} with: +-<<c10,eval=false>>= +-plotGenomeHits(genome, seqnames(genome), hits1) +-@ +-\end{itemize} +- ++% DELETED + + + % --------------------------------------------------------------------------- +@@ -466,25 +187,8 @@ that actually implements the fast search + So if you need to reuse the same pattern a high number of times, + it's a good idea to convert it {\it before} to pass it to the + \Rmethod{matchPattern} or \Rmethod{countPattern} method. +-This way the conversion is done only once: +-<<d1>>= +-library(hgu95av2probe) +-tmpseq <- DNAStringSet(hgu95av2probe$sequence) +-someStats <- function(v) +-{ +- GC <- DNAString("GC") +- CG <- DNAString("CG") +- sapply(seq_len(length(v)), +- function(i) { +- y <- v[[i]] +- c(alphabetFrequency(y)[1:4], +- GC=countPattern(GC, y), +- CG=countPattern(CG, y)) +- } +- ) +-} +-someStats(tmpseq[1:10]) +-@ ++ ++% DELETED + + % The above example is Raphael's use case discussed on BioC on Feb 2006. + % In Biostrings 1, the equivalent would be: +@@ -528,14 +232,7 @@ repeats with period less than or equal t + For a given package, all the sequences will always have the same number of + masks. + +-<<f1>>= +-library(BSgenome.Hsapiens.UCSC.hg38.masked) +-genome <- BSgenome.Hsapiens.UCSC.hg38.masked +-chrY <- genome$chrY +-chrY +-chrM <- genome$chrM +-chrM +-@ ++% DELETED + + The built-in masks are named consistenly across all the BSgenome data packages + available in Bioconductor: +@@ -578,128 +275,7 @@ The {\it masked width} is the total numb + that are masked and the {\it masked ratio} is the {\it masked width} + divided by the length of the sequence. + +-To activate a mask, use the \Rmethod{active} replacement method +-in conjonction with the \Rmethod{masks} method. For example, to +-activate the RepeatMasker mask, do: +-<<f2>>= +-active(masks(chrY))["RM"] <- TRUE +-chrY +-@ +- +-As you can see, the {\it masked width} for all the active masks +-together (i.e. the total number of nucleotide positions that are +-masked by at least one active mask) is now the same as for the +-first mask. This represents a {\it masked ratio} of about 83\%. +- +-Now when we use a function that is {\it mask aware}, like +-\Rfunction{alphabetFrequency}, the masked regions of the +-input sequence are ignored: +-<<f3>>= +-active(masks(chrY)) <- FALSE +-active(masks(chrY))["AGAPS"] <- TRUE +-alphabetFrequency(unmasked(chrY)) +-alphabetFrequency(chrY) +-@ +- +-This output indicates that, for this chromosome, the assembly gaps +-correspond exactly to the regions in the sequence that were filled +-with the letter N. Note that this is not always the case: sometimes +-Ns, and other IUPAC ambiguity letters, can be found inside the contigs. +- +-When coercing a \Rclass{MaskedXString} object to an \Rclass{XStringViews} +-object, each non-masked region in the original sequence is converted into +-a view on the sequence: +-<<f4>>= +-as(chrY, "XStringViews") +-@ +- +-This can be used in conjonction with the \Rmethod{gaps} method to +-see the gaps between the views i.e. the masked regions themselves: +-<<f5,results=hide>>= +-gaps(as(chrY, "XStringViews")) +-@ +- +-To extract the sizes of the assembly gaps: +-<<f6>>= +-width(gaps(as(chrY, "XStringViews"))) +-@ +- +-Note that, if applied directly to \Robject{chrY}, \Rmethod{gaps} +-returns a \Rclass{MaskedDNAString} object with a single mask masking +-the regions that are not masked in the original object: +-<<f7>>= +-gaps(chrY) +-alphabetFrequency(gaps(chrY)) +-@ +- +-In fact, for any \Rclass{MaskedDNAString} object, the following should +-always be \Robject{TRUE}, whatever the masks are: +-<<f8>>= +-af0 <- alphabetFrequency(unmasked(chrY)) +-af1 <- alphabetFrequency(chrY) +-af2 <- alphabetFrequency(gaps(chrY)) +-all(af0 == af1 + af2) +-@ +- +-With all chrY masks active: +-<<f9>>= +-active(masks(chrY)) <- TRUE +-af1 <- alphabetFrequency(chrY) +-af1 +-gaps(chrY) +-af2 <- alphabetFrequency(gaps(chrY)) +-af2 +-all(af0 == af1 + af2) +-@ +- +-Now let's compare three different ways of finding all the occurences of +-the \Robject{"CANNTG"} consensus sequence in chrY. The Ns in this +-pattern need to be treated as wildcards i.e. they must match any +-letter in the subject. +- +-Without the mask feature, the first way to do it would be to use +-the \Robject{fixed=FALSE} option in the call to \Rfunction{matchPattern} +-(or \Rfunction{countPattern}): +-<<f10>>= +-Ebox <- "CANNTG" +-active(masks(chrY)) <- FALSE +-countPattern(Ebox, chrY, fixed=FALSE) +-@ +- +-The problem with this method is that the Ns in the subject +-are also treated as wildcards hence the abnormally high number of +-matches. +-A better method is to specify the {\it side} of the matching problem +-(i.e. {\it pattern} or {\it subject}) where the Ns should be treated +-as wildcards: +-<<f11>>= +-countPattern(Ebox, chrY, fixed=c(pattern=FALSE,subject=TRUE)) +-@ +- +-Finally, \Rfunction{countPattern} being {\it mask aware}, this can be +-achieved more efficiently by just masking the assembly gaps and +-ambiguities: +-<<f12>>= +-active(masks(chrY))[c("AGAPS", "AMB")] <- TRUE +-alphabetFrequency(chrY, baseOnly=TRUE) # no ambiguities +-countPattern(Ebox, chrY, fixed=FALSE) +-@ +- +-Note that some chromosomes can have Ns outside the assembly gaps: +-<<f13>>= +-chr2 <- genome$chr2 +-active(masks(chr2))[-2] <- FALSE +-alphabetFrequency(gaps(chr2)) +-@ +-so it is recommended to always keep the AMB mask active (in addition to +-the AGAPS mask) whatever the sequence is. +- +-Note that not all functions that work with an \Rclass{XString} +-input are {\it mask aware} but more will be added in the near future. +-However, most of the times there is a alternate way to exclude some +-arbitrary regions from an analysis without having to use {\it mask aware} +-functions. This is described below in the {\it Hard masking} section. +- ++% DELETED + + + % --------------------------------------------------------------------------- +@@ -766,16 +342,13 @@ runAnalysis2 <- function(dict0, outfile= + Remember that \Rfunction{matchPDict} only works if all the patterns in the + input dictionary have the same length so for this 2nd analysis, we will truncate + the patterns in \Robject{ce2dict0} to 15 nucleotides: +-<<e2>>= +-ce2dict0cw15 <- DNAStringSet(ce2dict0, end=15) +-@ + +-Now we can run this 2nd analysis and put the results in the +-\Robject{"ce2dict0cw15_ana2.txt"} file: +-<<e3>>= +-runAnalysis2(ce2dict0cw15, outfile="ce2dict0cw15_ana2.txt") +-@ ++% STRANGE THAT ALSO THIS DOES NOT WORK ... ??? ++%<<e2>>= ++%ce2dict0cw15 <- DNAStringSet(ce2dict0, end=15) ++%@ + ++% DELETED + + + % --------------------------------------------------------------------------- +@@ -788,3 +361,4 @@ sessionInfo() + + \end{document} + ++ Added: trunk/packages/R/r-bioc-bsgenome/trunk/debian/patches/series =================================================================== --- trunk/packages/R/r-bioc-bsgenome/trunk/debian/patches/series (rev 0) +++ trunk/packages/R/r-bioc-bsgenome/trunk/debian/patches/series 2015-07-17 21:14:54 UTC (rev 19631) @@ -0,0 +1 @@ +remove_paragraphs_bound_to_fail_from_vignette.patch _______________________________________________ debian-med-commit mailing list [email protected] http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/debian-med-commit
