On Tue, Feb 9, 2010 at 4:36 AM, Johannes Rainer <[email protected]
> wrote:

> dear all,
>
> I'm wondering if there is already a function implemented in any
> Bioconductor
> package that allows to identify a common sequence pattern in a set of
> sequences.
>
> I'm asking this because in my ChIPseq data out of the 20 mio reads only
> about 3 mio can be aligned to the (human) genome (using bowtie), and, by
> looking at the sequences that can not be aligned (see below), there seem to
> be certain sequence patterns (like GGCCACGCGTCGACTAGTAC). Actually I have
> absolutely no idea where these sequences could come from. They are not
> adapter or primer sequences, since I've aligned all adapter/primer
> sequences
> I've got from the provider against these sequences.
>
> Is there any way to extract common sequence patterns (like
> the GGCCACGCGTCGACTAGTAC) in an automated manner form these sequences?
>

Well, my first step would be to call table(), and see how many are hugely
over-represented. Then you could say pick the most frequent sequence and
align the rest against that fixed subject using
Biostrings::pairwiseAlignment(). See what you get out.

Michael

besides that, did anybody experience the same problem?
>
> bests, jo
>
>
>  A DNAStringSet instance of length 16196935
>           width seq
>       [1]    36 GGCCCCGCGTCGCCTAGTACTACATAAACAATGACC
>       [2]    36 GGCGATGACCTTCTTGTGACCGTTGTGCATGCCGNC
>       [3]    36 GTTTCCCAGTCACGGTCATGCTTCCTGTTTCCCAGC
>       [4]    36 GTTTCCCAGTCACGGTCGTCCTTTTATTCTGACCTG
>       [5]    36 GGCCACGCGTCGACTAGTACTTAAAAATATCGCACG
>       [6]    36 GGCCACGCGTCGACTAGTACAGAAAAGACCGTGACT
>       [7]    36 GGCCACGCGTCGACTAGTACAAAGGACATCACGCCG
>       [8]    36 GGCCACGCGTCGACTAGTACAGAGTAAACAACGACC
>       [9]    36 CAGTCACGGTCAAAAAATACATACTAAACACCTACT
>       ...   ... ...
> [16196927]    36 CAGTCACGGTCTGGCGGNATNNTTTTTGTACTAGTC
> [16196928]    36 TAGCCAGCCAAGCCAGCNAANNCAGCCATCCAGCCA
> [16196929]    36 GCGCCCCTGTCGCGGACNACNNGTAAGCAGCTCTCT
> [16196930]    36 ACTACACCCCTTAGCAANGANNATCTGAGCCTCCAT
> [16196931]    36 ACTACAAGCAAACAGTGNTCNNCTATGGTCCAGATC
> [16196932]    36 GCAGCCACGTCCCGATCNCCNNTTTGAGTGCGTGCG
> [16196933]    36 GGCCACGCGTCGACTAGNACNNCGAAAAATACGACC
> [16196934]    36 GGCCACGCGTCGACTAGTACNNAAAAAACAACGCCT
> [16196935]    36 AGTCACGGTCAAGTAACACANNAACAGAAAACCAAA
>
> --
> Johannes Rainer, PhD
> Bioinformatics Group,
> Division Molecular Pathophysiology,
> Biocenter, Medical University Innsbruck,
> Fritz-Pregl-Str 3/IV, 6020 Innsbruck, Austria
> and
> Tyrolean Cancer Research Institute
> Innrain 66, 6020 Innsbruck, Austria
>
> Tel.:     +43 512 570485 13
> Email:  [email protected]
>           [email protected]
> URL:   http://bioinfo.i-med.ac.at
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> [email protected]
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to