Re: [Bioc-sig-seq] finding matches to a motif

Paul Leo Wed, 26 Jan 2011 15:56:49 -0800

Hi Ivan,
Here is just a straight cut and paste for something I did some time back
if gets the location and the counts of alignments with a PWM  that pass
a certain score for a set of regions . If you have a lot of regions
using XstringViews will speed things up. This is only rough code ...


here "the.chrom" "start" and "end" are large vectors containing the
regions that you wnat to search:
the.chrom=c("chr1","chr1","chr3"...)
starts=c(1000,2000,2000...)
end=c(2000,3000,4000....)

library("BSgenome.Mmusculus.UCSC.mm9")
Mmusculus
all.genomic<-getSeq(Mmusculus, the.chrom, starts, ends)
length(all.genomic)


system.time(x<- XStringViews(all.genomic, "DNAString"))

## $transMat$order0
##            A         C         G         T
## -- 0.2547153 0.2453616 0.2451361 0.2547870


x.labels<-paste(the.chrom,starts,ends,sep=":")
names(x)<-x.labels



#################################
       ## ##
---------------------------------------------------------------------
       ## ## C. SEARCHING THE MINUS STRAND OF A CHROMOSOME
       ## ##
---------------------------------------------------------------------
       ## ## Applying reverseComplement() to the pattern before calling
       ## ## matchPattern() is the recommended way of searching hits on
the
       ## ## minus strand of a chromosome.
     
  

mime.ev<-c( 0.129955,  0.433398,  0.207277,  0.229370,
 0.196881,  0.480832,  0.009747,  0.312541,
 0.978558,  0.020143,  0.000650,  0.000650,
 0.988304,  0.000650,  0.011046 , 0.000000,
 0.000000,  1.000000,  0.000000,  0.000000,
 0.181936,  0.071475,  0.175439,  0.571150,
 0.000000,  0.004548,  0.990903,  0.004548,
 0.232619,  0.421702,  0.045484,  0.300195,
 0.250162,  0.576998,  0.016244,  0.156595,
 0.452242,  0.178687,  0.133853,  0.235218,
 0.198181,  0.338532,  0.190383,  0.272904)  ### this was the one used 

a.pwm<-mime.ev



min.score="85.0%" #85 % for mime 87.5 for crank
all.matches<-matchPWM(a.pwm,x,min.score=min.score) # needs a stringView
to vectorize
the.cov<-coverage(all.matches)
counts<-aggregate(the.cov,start=start(x),end=end(x),FUN=sum)/dim(a.pwm)[2]
sum(counts!=0)

all.matches.comp<-matchPWM(reverseComplement(a.pwm),x,min.score=min.score)  
#verified using seqlogo as ok
the.cov.comp<-coverage(all.matches.comp)
counts.comp<-aggregate(the.cov.comp,start=start(x),end=end(x),FUN=sum)/dim(a.pwm)[2]
sum(counts.comp!=0)

############################################### Check frequency at a
position 
############################################### 

counts.t<-counts+counts.comp
###################### forward posns #################
loc.f=rep(NA,times=dim(all.data80)[1])
loc.hit<-start(all.matches)
region.start<-start(x)
hits<-grep(TRUE,counts>0) # places where a hits is found
k<-1
no.hits<-counts[counts>0]
for(i in 1:length(hits)){
  loc.f[hits[i]]<-toString((loc.hit[k:(k
+no.hits[i]-1)]-region.start[hits[i]])+1)
  k<-k+no.hits[i]
}
######################################################

###################### revearse posns #################
loc.c=rep(NA,times=dim(all.data80)[1])
loc.hit<-start(all.matches.comp)
region.start<-start(x)
hits<-grep(TRUE,counts.comp>0) # places where a hits is found
k<-1
no.hits<-counts.comp[counts.comp>0]
for(i in 1:length(hits)){
  loc.c[hits[i]]<-toString((loc.hit[k:(k
+no.hits[i]-1)]-region.start[hits[i]])+1)
  k<-k+no.hits[i]
}
######################################################

######combine loc.f and loc.c all w.r.t forward strand posn see doc
after ?reverseComplement
has.c<-!is.na(loc.c)
has.f<-!is.na(loc.f)
loc.all<-loc.c
loc.all[has.c & has.f]<-paste(loc.all[has.c & has.f],loc.f[has.c &
has.f],sep=", ") # both present
loc.all[!has.c & has.f]<-loc.f[!has.c & has.f] # new in loc.f

## cbind(loc.c,loc.f,loc.all)[1:50,] # a check


you can replace some of the for loops above with a vmatch ...






-----Original Message-----
From: Ivan Gregoretti <[email protected]>
To: [email protected]
Subject: [Bioc-sig-seq] finding matches to a motif
Date: Tue, 25 Jan 2011 10:36:14 -0500


Hello everybody,

Introduction:
I have a Position Weigh Matrix that characterises the binding DNA
sequence of a novel transcription factor.

Goal:
I want to find the genomic positions that are good matches for my PWM.

Question:
What are my options among the Bioconductor packages?

Thank you,

Ivan


Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1016 and 1-301-496-1592
Fax: 1-301-496-9878

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] finding matches to a motif

Reply via email to