Hi, I'm trying to use vectorstrip on FASTQ files (as a simple way to remove adaptor or primer sequences). However, it seems that on output the FASTQ qualities are missing (all set to the double quote, ASCII 33, meaning PHRED quality 1 or random). Is this a known bug (or rather, a missing feature)?
For illustration I am using a Sanger style FASTQ file from the NCBI SRA (short reads originally from Solexa/Illumina), SRR014849.fastq which you can download from ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX003/SRX003639/SRR014849.fastq.gz I am pretending "GTTGGAACCG" is 5' adaptor sequence, and want to find any matches in some FASTQ reads, and trim it off taking only the sequence to the right. For simplicity I'm allowing no mismatches. Here is the start of the file: $ head -n 12 SRR014849.fastq @SRR014849.1 EIXKN4201CFU84 length=93 GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGAAAGGGTTTTGAATTTCAAACCCTTTTCGGTTTCCAACCTTCCAAAGCAATGCCAATA +SRR014849.1 EIXKN4201CFU84 length=93 3+&$#"""""""""""7...@71,'";C?,B;?6B;:EA1EA1EA5'9B:?:#9e...@2ea5':>5?:%A;A8A;?9B;D@/=<?7=9<2A8== @SRR014849.3 EIXKN4201D4ZBL length=119 GGGGGGGGGCTGTTGGCCGAGGTTGGAGTAGCCAGGGGGAAGGCATGGCCAGCCGTTGAGAAATGCTTGTTGAAGTTTTCGATAATAATGGATTTATCGGTGGTGACCGTGTTACCTAG +SRR014849.3 EIXKN4201D4ZBL length=119 ;3.*(&$"";<=...@8a9;<B;B;B;8=<==B;<FB8/'@8B:==<B;A9<<A8=B;==;A=)=<<B;=A9<@7<FB5(<<=<B;<B;:A9=EA0;<;B:<A8=<<@8<<<B;<A99=< @SRR014849.9 EIXKN4201AL42E length=84 AACATAAAGAGCAATAGACAGTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTGGTTCCAACTTGTCTTGCTTTAGCCTTTTA +SRR014849.9 EIXKN4201AL42E length=84 B:=8<EA087<;@8<<<8<:8A9=3>5B;4B>+C?,EA09B;@;9E@/EA/E@/B:;1B:B:;A9<5<B;;8EA0<<B;FB6)7 Notice the "adaptor" in in the third sequence, SRR014849.9, AACATAAAGAGCAATAGACAGTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTGGTTCCAACTTGTCTTGCTTTAGCCTTTTA This should be trimmed to just: AAAGGGTTTGAATTCAAACCCTTTGGTTCCAACTTGTCTTGCTTTAGCCTTTTA Using FASTA as output looks fine: $ vectorstrip -sequence SRR014849.fastq -sformat fastq-sanger -readfile N -alinker "GTTGGAACCG" -blinker "" -osformat fasta -outseq SRR014849_5trimmed.fasta -mismatch 0 -besthits Y -outfile SRR014849_5trimmed.txt Removes vectors from the ends of nucleotide sequence(s) $ head -n 2 SRR014849_5trimmed.fasta >SRR014849.9_from_31_to_84 EIXKN4201AL42E length=84 AAAGGGTTTGAATTCAAACCCTTTGGTTCCAACTTGTCTTGCTTTAGCCTTTTA Using Sanger FASTQ runs: $ vectorstrip -sequence SRR014849.fastq -sformat fastq-sanger -readfile N -alinker "GTTGGAACCG" -blinker "" -osformat fastq-sanger -outseq SRR014849_5trimmed.fastq -mismatch 0 -besthits Y -outfile SRR014849_5trimmed.txt Removes vectors from the ends of nucleotide sequence(s) But the output is missing the quality scores: $ head -n 4 SRR014849_5trimmed.fastq @SRR014849.9_from_31_to_84 EIXKN4201AL42E length=84 AAAGGGTTTGAATTCAAACCCTTTGGTTCCAACTTGTCTTGCTTTAGCCTTTTA + """""""""""""""""""""""""""""""""""""""""""""""""""""" Is this something simple to add to vectorstrip? What about other annotation (e.g. running vector strip on annotated GenBank or EMBL files)? Thanks, Peter C. P.S. This is with EMBOSS 6.1.0 with a patch from Peter Rice, running on Mac OS X. _______________________________________________ EMBOSS mailing list [email protected] http://lists.open-bio.org/mailman/listinfo/emboss
