Hello Paul, First, thank you for sharing your code.
Right now, I am using a work flow of my own creation that is philosophically identical to your solution. However, I still have a question: How did you handle true and false positives? For instance, by lowering min.score you can get matchPWM to report thousands of matches. There is a point at which the naked eye can tell that some matches are false positives. Have you or anybody designed a statistically defensible strategy to call reliable genomic targets starting from a PWM? By defensible I mean publishable. Thank you, Ivan Ivan Gregoretti, PhD National Institute of Diabetes and Digestive and Kidney Diseases National Institutes of Health 5 Memorial Dr, Building 5, Room 205. Bethesda, MD 20892. USA. Phone: 1-301-496-1016 and 1-301-496-1592 Fax: 1-301-496-9878 On Wed, Jan 26, 2011 at 6:55 PM, Paul Leo <[email protected]> wrote: > > Hi Ivan, > Here is just a straight cut and paste for something I did some time back if > gets the location and the counts of alignments with a PWM that pass a > certain score for a set of regions . If you have a lot of regions using > XstringViews will speed things up. This is only rough code ... > > here "the.chrom" "start" and "end" are large vectors containing the regions > that you wnat to search: > the.chrom=c("chr1","chr1","chr3"...) > starts=c(1000,2000,2000...) > end=c(2000,3000,4000....) > > library("BSgenome.Mmusculus.UCSC.mm9") > Mmusculus > all.genomic<-getSeq(Mmusculus, the.chrom, starts, ends) > length(all.genomic) > > > system.time(x<- XStringViews(all.genomic, "DNAString")) > > ## $transMat$order0 > ## A C G T > ## -- 0.2547153 0.2453616 0.2451361 0.2547870 > > > x.labels<-paste(the.chrom,starts,ends,sep=":") > names(x)<-x.labels > > > > ################################# > ## ## > --------------------------------------------------------------------- > ## ## C. SEARCHING THE MINUS STRAND OF A CHROMOSOME > ## ## > --------------------------------------------------------------------- > ## ## Applying reverseComplement() to the pattern before calling > ## ## matchPattern() is the recommended way of searching hits on the > ## ## minus strand of a chromosome. > > > > mime.ev<-c( 0.129955, 0.433398, 0.207277, 0.229370, > 0.196881, 0.480832, 0.009747, 0.312541, > 0.978558, 0.020143, 0.000650, 0.000650, > 0.988304, 0.000650, 0.011046 , 0.000000, > 0.000000, 1.000000, 0.000000, 0.000000, > 0.181936, 0.071475, 0.175439, 0.571150, > 0.000000, 0.004548, 0.990903, 0.004548, > 0.232619, 0.421702, 0.045484, 0.300195, > 0.250162, 0.576998, 0.016244, 0.156595, > 0.452242, 0.178687, 0.133853, 0.235218, > 0.198181, 0.338532, 0.190383, 0.272904) ### this was the one used > > a.pwm<-mime.ev > > > > min.score="85.0%" #85 % for mime 87.5 for crank > all.matches<-matchPWM(a.pwm,x,min.score=min.score) # needs a stringView to > vectorize > the.cov<-coverage(all.matches) > counts<-aggregate(the.cov,start=start(x),end=end(x),FUN=sum)/dim(a.pwm)[2] > sum(counts!=0) > > all.matches.comp<-matchPWM(reverseComplement(a.pwm),x,min.score=min.score) > #verified using seqlogo as ok > the.cov.comp<-coverage(all.matches.comp) > counts.comp<-aggregate(the.cov.comp,start=start(x),end=end(x),FUN=sum)/dim(a.pwm)[2] > sum(counts.comp!=0) > > ############################################### Check frequency at a position > ############################################### > > counts.t<-counts+counts.comp > ###################### forward posns ################# > loc.f=rep(NA,times=dim(all.data80)[1]) > loc.hit<-start(all.matches) > region.start<-start(x) > hits<-grep(TRUE,counts>0) # places where a hits is found > k<-1 > no.hits<-counts[counts>0] > for(i in 1:length(hits)){ > > loc.f[hits[i]]<-toString((loc.hit[k:(k+no.hits[i]-1)]-region.start[hits[i]])+1) > k<-k+no.hits[i] > } > ###################################################### > > ###################### revearse posns ################# > loc.c=rep(NA,times=dim(all.data80)[1]) > loc.hit<-start(all.matches.comp) > region.start<-start(x) > hits<-grep(TRUE,counts.comp>0) # places where a hits is found > k<-1 > no.hits<-counts.comp[counts.comp>0] > for(i in 1:length(hits)){ > > loc.c[hits[i]]<-toString((loc.hit[k:(k+no.hits[i]-1)]-region.start[hits[i]])+1) > k<-k+no.hits[i] > } > ###################################################### > > ######combine loc.f and loc.c all w.r.t forward strand posn see doc after > ?reverseComplement > has.c<-!is.na(loc.c) > has.f<-!is.na(loc.f) > loc.all<-loc.c > loc.all[has.c & has.f]<-paste(loc.all[has.c & has.f],loc.f[has.c & > has.f],sep=", ") # both present > loc.all[!has.c & has.f]<-loc.f[!has.c & has.f] # new in loc.f > > ## cbind(loc.c,loc.f,loc.all)[1:50,] # a check > > > you can replace some of the for loops above with a vmatch ... > > > > > > > -----Original Message----- > From: Ivan Gregoretti <[email protected]> > To: [email protected] > Subject: [Bioc-sig-seq] finding matches to a motif > Date: Tue, 25 Jan 2011 10:36:14 -0500 > > Hello everybody, > > Introduction: > I have a Position Weigh Matrix that characterises the binding DNA > sequence of a novel transcription factor. > > Goal: > I want to find the genomic positions that are good matches for my PWM. > > Question: > What are my options among the Bioconductor packages? > > Thank you, > > Ivan > > > Ivan Gregoretti, PhD > National Institute of Diabetes and Digestive and Kidney Diseases > National Institutes of Health > 5 Memorial Dr, Building 5, Room 205. > Bethesda, MD 20892. USA. > Phone: 1-301-496-1016 and 1-301-496-1592 > Fax: 1-301-496-9878 > > _______________________________________________ > Bioc-sig-sequencing mailing list > [email protected] > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
