On Tue, Feb 9, 2010 at 4:36 AM, Johannes Rainer <[email protected] > wrote:
> dear all, > > I'm wondering if there is already a function implemented in any > Bioconductor > package that allows to identify a common sequence pattern in a set of > sequences. > > I'm asking this because in my ChIPseq data out of the 20 mio reads only > about 3 mio can be aligned to the (human) genome (using bowtie), and, by > looking at the sequences that can not be aligned (see below), there seem to > be certain sequence patterns (like GGCCACGCGTCGACTAGTAC). Actually I have > absolutely no idea where these sequences could come from. They are not > adapter or primer sequences, since I've aligned all adapter/primer > sequences > I've got from the provider against these sequences. > > Is there any way to extract common sequence patterns (like > the GGCCACGCGTCGACTAGTAC) in an automated manner form these sequences? > Well, my first step would be to call table(), and see how many are hugely over-represented. Then you could say pick the most frequent sequence and align the rest against that fixed subject using Biostrings::pairwiseAlignment(). See what you get out. Michael besides that, did anybody experience the same problem? > > bests, jo > > > A DNAStringSet instance of length 16196935 > width seq > [1] 36 GGCCCCGCGTCGCCTAGTACTACATAAACAATGACC > [2] 36 GGCGATGACCTTCTTGTGACCGTTGTGCATGCCGNC > [3] 36 GTTTCCCAGTCACGGTCATGCTTCCTGTTTCCCAGC > [4] 36 GTTTCCCAGTCACGGTCGTCCTTTTATTCTGACCTG > [5] 36 GGCCACGCGTCGACTAGTACTTAAAAATATCGCACG > [6] 36 GGCCACGCGTCGACTAGTACAGAAAAGACCGTGACT > [7] 36 GGCCACGCGTCGACTAGTACAAAGGACATCACGCCG > [8] 36 GGCCACGCGTCGACTAGTACAGAGTAAACAACGACC > [9] 36 CAGTCACGGTCAAAAAATACATACTAAACACCTACT > ... ... ... > [16196927] 36 CAGTCACGGTCTGGCGGNATNNTTTTTGTACTAGTC > [16196928] 36 TAGCCAGCCAAGCCAGCNAANNCAGCCATCCAGCCA > [16196929] 36 GCGCCCCTGTCGCGGACNACNNGTAAGCAGCTCTCT > [16196930] 36 ACTACACCCCTTAGCAANGANNATCTGAGCCTCCAT > [16196931] 36 ACTACAAGCAAACAGTGNTCNNCTATGGTCCAGATC > [16196932] 36 GCAGCCACGTCCCGATCNCCNNTTTGAGTGCGTGCG > [16196933] 36 GGCCACGCGTCGACTAGNACNNCGAAAAATACGACC > [16196934] 36 GGCCACGCGTCGACTAGTACNNAAAAAACAACGCCT > [16196935] 36 AGTCACGGTCAAGTAACACANNAACAGAAAACCAAA > > -- > Johannes Rainer, PhD > Bioinformatics Group, > Division Molecular Pathophysiology, > Biocenter, Medical University Innsbruck, > Fritz-Pregl-Str 3/IV, 6020 Innsbruck, Austria > and > Tyrolean Cancer Research Institute > Innrain 66, 6020 Innsbruck, Austria > > Tel.: +43 512 570485 13 > Email: [email protected] > [email protected] > URL: http://bioinfo.i-med.ac.at > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioc-sig-sequencing mailing list > [email protected] > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
