Hi, Perhaps you can try the "sub" function from R. Not sure if there is a more efficient way, but it should work.
By the way, if you google the sequence (GGCCACGCGTCGACTAGTAC) you will find it in several papers. I have the impression that sometimes it is used as a primer for the generation of the first cDNA strand. Dr. Jose M Muino Plant Research International B.V. P.O. Box 619, 6700 AP Wageningen, The Netherlands Phone: +0317-481122. E-mail: [email protected] http://www.pri.wur.nl > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf > Of Johannes Rainer > Sent: dinsdag 9 februari 2010 13:37 > To: [email protected] > Subject: [Bioc-sig-seq] identifying a common motif in a set > of sequences > > dear all, > > I'm wondering if there is already a function implemented in > any Bioconductor package that allows to identify a common > sequence pattern in a set of sequences. > > I'm asking this because in my ChIPseq data out of the 20 mio > reads only about 3 mio can be aligned to the (human) genome > (using bowtie), and, by looking at the sequences that can not > be aligned (see below), there seem to be certain sequence > patterns (like GGCCACGCGTCGACTAGTAC). Actually I have > absolutely no idea where these sequences could come from. > They are not adapter or primer sequences, since I've aligned > all adapter/primer sequences I've got from the provider > against these sequences. > > Is there any way to extract common sequence patterns (like > the GGCCACGCGTCGACTAGTAC) in an automated manner form these sequences? > besides that, did anybody experience the same problem? > > bests, jo > > > A DNAStringSet instance of length 16196935 > width seq > [1] 36 GGCCCCGCGTCGCCTAGTACTACATAAACAATGACC > [2] 36 GGCGATGACCTTCTTGTGACCGTTGTGCATGCCGNC > [3] 36 GTTTCCCAGTCACGGTCATGCTTCCTGTTTCCCAGC > [4] 36 GTTTCCCAGTCACGGTCGTCCTTTTATTCTGACCTG > [5] 36 GGCCACGCGTCGACTAGTACTTAAAAATATCGCACG > [6] 36 GGCCACGCGTCGACTAGTACAGAAAAGACCGTGACT > [7] 36 GGCCACGCGTCGACTAGTACAAAGGACATCACGCCG > [8] 36 GGCCACGCGTCGACTAGTACAGAGTAAACAACGACC > [9] 36 CAGTCACGGTCAAAAAATACATACTAAACACCTACT > ... ... ... > [16196927] 36 CAGTCACGGTCTGGCGGNATNNTTTTTGTACTAGTC > [16196928] 36 TAGCCAGCCAAGCCAGCNAANNCAGCCATCCAGCCA > [16196929] 36 GCGCCCCTGTCGCGGACNACNNGTAAGCAGCTCTCT > [16196930] 36 ACTACACCCCTTAGCAANGANNATCTGAGCCTCCAT > [16196931] 36 ACTACAAGCAAACAGTGNTCNNCTATGGTCCAGATC > [16196932] 36 GCAGCCACGTCCCGATCNCCNNTTTGAGTGCGTGCG > [16196933] 36 GGCCACGCGTCGACTAGNACNNCGAAAAATACGACC > [16196934] 36 GGCCACGCGTCGACTAGTACNNAAAAAACAACGCCT > [16196935] 36 AGTCACGGTCAAGTAACACANNAACAGAAAACCAAA > > -- > Johannes Rainer, PhD > Bioinformatics Group, > Division Molecular Pathophysiology, > Biocenter, Medical University Innsbruck, Fritz-Pregl-Str > 3/IV, 6020 Innsbruck, Austria and Tyrolean Cancer Research > Institute Innrain 66, 6020 Innsbruck, Austria > > Tel.: +43 512 570485 13 > Email: [email protected] > [email protected] > URL: http://bioinfo.i-med.ac.at > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioc-sig-sequencing mailing list > [email protected] > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > > _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
