I also do something like that in perl...

Regarding MAQ, back when I tried it it could _not_ deal with linker sequences in any useful way... not for small RNA runs at least. It might have improved though... and it is _really_ fast.

Cheers,

Cei

On Thu, 17 Jul 2008, Krys Kelly wrote:

Hi Harris

Thanks for this.  I started doing something like this in Perl.

Regards

Krys


-----Original Message-----
From: Harris A. Jaffee [mailto:[EMAIL PROTECTED]
Sent: 17 July 2008 16:00
To: Krys Kelly
Cc: [email protected]
Subject: Re: [Bioc-sig-seq] Adapter removal

On Jul 17, 2008, at 9:47 AM, Krys Kelly wrote:
I have inherited a pipeline for Solexa sequence data ...  This
seems pretty
crude

could anyone point me in the direction of some code?


[ This may now be silly, but I will send it anyway. ]

On (quasi) UNIX, you can use something like the script below, which
elides the
largest initial portion of our 3' adapter and then the largest
terminal portion
of our 5' adapter, minimal length 3 in both cases, from a data file
where the
read (possibly containing ".") was the 5th field.

To allow 1 error, (using awk, for example) you can blow each pattern
in there
up into a regular expression as in this sed edit command (the \'s
don't belong
but were necessary for some reason):

's,.CGTATGCCGTCTTCTGCTTG$\|T.GTATGCCGTCTTCTGCTTG$\|
TC.TATGCCGTCTTCTGCTTG$\|TCG.ATGCCGTCTTCTGCTTG$\|TCGT.TGCCGTCTTCTGCTTG$
\|TCGTA.GCCGTCTTCTGCTTG$\|TCGTAT.CCGTCTTCTGCTTG$\|
TCGTATG.CGTCTTCTGCTTG$\|TCGTATGC.GTCTTCTGCTTG$\|TCGTATGCC.TCTTCTGCTTG$
\|TCGTATGCCG.CTTCTGCTTG$\|TCGTATGCCGT.TTCTGCTTG$\|
TCGTATGCCGTC.TCTGCTTG$\|TCGTATGCCGTCT.CTGCTTG$\|TCGTATGCCGTCTT.TGCTTG$
\|TCGTATGCCGTCTTC.GCTTG$\|TCGTATGCCGTCTTCT.CTTG$\|
TCGTATGCCGTCTTCTG.TTG$\|TCGTATGCCGTCTTCTGC.TG$\|TCGTATGCCGTCTTCTGCT.G$
\|TCGTATGCCGTCTTCTGCTT.$,>,'

In our data, the whole 3' adapter may occur at the 5' end, making the
read
worthless, or in the interior!

All of this probably translates into R, but this was quicker for me.
-Harris

--------------
#!/bin/sh

DIR=Processed

fgrep -v . |
awk '{print $5}' |
sed \
-e 's,TCGTATGCCGTCTTCTGCTTG$,>,' \
-e 's,TCGTATGCCGTCTTCTGCTT$,>,' \
-e 's,TCGTATGCCGTCTTCTGCT$,>,' \
-e 's,TCGTATGCCGTCTTCTGC$,>,' \
-e 's,TCGTATGCCGTCTTCTG$,>,' \
-e 's,TCGTATGCCGTCTTCT$,>,' \
-e 's,TCGTATGCCGTCTTC$,>,' \
-e 's,TCGTATGCCGTCTT$,>,' \
-e 's,TCGTATGCCGTCT$,>,' \
-e 's,TCGTATGCCGTC$,>,' \
-e 's,TCGTATGCCGT$,>,' \
-e 's,TCGTATGCCG$,>,' \
-e 's,TCGTATGCC$,>,' \
-e 's,TCGTATGC$,>,' \
-e 's,TCGTATG$,>,' \
-e 's,TCGTAT$,>,' \
-e 's,TCGTA$,>,' \
-e 's,TCGT$,>,' \
-e 's,TCG$,>,' \
-e 's,^GTTCAGAGTTCTACAGTCCGACGATC,<,' \
-e 's,^TTCAGAGTTCTACAGTCCGACGATC,<,' \
-e 's,^TCAGAGTTCTACAGTCCGACGATC,<,' \
-e 's,^CAGAGTTCTACAGTCCGACGATC,<,' \
-e 's,^AGAGTTCTACAGTCCGACGATC,<,' \
-e 's,^GAGTTCTACAGTCCGACGATC,<,' \
-e 's,^AGTTCTACAGTCCGACGATC,<,' \
-e 's,^GTTCTACAGTCCGACGATC,<,' \
-e 's,^TTCTACAGTCCGACGATC,<,' \
-e 's,^TCTACAGTCCGACGATC,<,' \
-e 's,^CTACAGTCCGACGATC,<,' \
-e 's,^TACAGTCCGACGATC,<,' \
-e 's,^ACAGTCCGACGATC,<,' \
-e 's,^CAGTCCGACGATC,<,' \
-e 's,^AGTCCGACGATC,<,' \
-e 's,^GTCCGACGATC,<,' \
-e 's,^TCCGACGATC,<,' \
-e 's,^CCGACGATC,<,' \
-e 's,^CGACGATC,<,' \
-e 's,^GACGATC,<,' \
-e 's,^ACGATC,<,' \
-e 's,^CGATC,<,' \
-e 's,^GATC,<,' \
-e 's,^ATC,<,' | tr -d '<>' |
        awk 'length>15' | sort | uniq -c |
        awk '{print >DIR "/" length($2)}' DIR=$DIR

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



--
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to