Hi Sylvain,

This piece of code should do it. The nrCGRandom changes with the windowlength though (I calculated this value from EPD a long time ago). I use org.apache.oro.text.regex.*;

Bye,
Stein.

=============================================

public void annotateCpG(Sequence seq) throws Exception{
       double nrCGRandom = 14.1048d;
       int windowLength = 200;
       int step=100;
       String motif = "cg";
       //
       PatternMatcher matcher=new Perl5Matcher();
       PatternCompiler compiler = new Perl5Compiler();
       Pattern pattern = null;
       PatternMatcherInput input;
       MatchResult result;
       pattern = compiler.compile(motif);
       int nrOfOcc=0;
       int nrConsec=0;
       double gcContent = 0d;
       double cgValue=0d;
       int index = 1;
       String dnaStr;
       SymbolList window= null;
       int i=1;
       while (index<seq.length()-windowLength){
         window = seq.subList(index,index+windowLength);
         dnaStr = window.seqString();
         input   = new PatternMatcherInput(dnaStr);
         while(matcher.contains(input, pattern)) {
             nrOfOcc++;
         }
         gcContent = GCContent(window);
         cgValue = (nrCgRandom/nrOfOcc);
         if (cgValue<0.6d && gcContent>50.0d){
           nrConsec++;
         }
         else if (nrConsec>0){
           int start = (index-(nrConsec)*step);
           int stop = index;
           nrConsec=0;
           Feature.Template template = new Feature.Template();
           template.type = "misc_feature";
           template.location = new RangeLocation(start,stop);
           template.annotation = new SimpleAnnotation();
           template.annotation.setProperty("type","CpG Island");
           seq.createFeature(template);
         }
         nrOfOcc=0;
         index=index+step;
     }
 }

public double GCContent(SymbolList seq){
int gc = 0;
for (int pos = 1; pos <= seq.length(); ++pos) {
org.biojava.bio.symbol.Symbol sym = seq.symbolAt(pos);
if (sym == org.biojava.bio.seq.DNATools.g() || sym == org.biojava.bio.seq.DNATools.c())
++gc;
}
double content = (gc*100)/seq.length();
return content;
}


Sylvain Foisy wrote:

Hi,

Sorry for the hand holding question. I am trying to wrap the output of a
Perl script that finds CpG islands (cpgi130.pl) using Runtime.exec with
little success. Has anyone ever built a class that can find CpG islands in
Sequence objects and build features with RangeLocations?

Other solution: has anyone ever try to write a wrapper that would use the
output of any program of the EMBOSS suite? It has some CpG island finding
software.

Anyhelp would be much appreciated

P.S.: there has been a call for an algorithm repository on Usenet. Would it
be possible/interesting to create such a repository for classes built with
BioJava that could be of interest to other rookies like me.

Just an idea...

Sylvain


=================================================================== Sylvain Foisy, Ph. D. Directeur - operations / Project Manager BioneQ - Reseau quebecois de bio-informatique U. de Montreal / Genome-Quebec

Addresse postale:

Departement de biochimie
Pavillon principal
2900, boul. Édouard-Montpetit
Montréal (Québec) H3T 1J4

Tel: (514) 343-6111 x.4066
Courriel: [EMAIL PROTECTED]
===================================================================


_______________________________________________ Biojava-l mailing list - [EMAIL PROTECTED] http://biojava.org/mailman/listinfo/biojava-l




-- Stein Aerts [EMAIL PROTECTED] K.U.Leuven ESAT-SCD Belgium http://www.esat.kuleuven.ac.be/~dna/BioI



_______________________________________________
Biojava-l mailing list  -  [EMAIL PROTECTED]
http://biojava.org/mailman/listinfo/biojava-l

Reply via email to