Hi

 

I have done my first amplicon sequencing run recently and am trying to
analyse the data at the same time as I learn how to use linux, shell
commands and bioinformatics software. I have amplified 5 genes from
different varieties of a plant (100 individuals from one variety per DNA
template pool) and then pooled those 5 PCRs together for indexing and
sequencing. My sequencing provider has demultiplexed by sample (231 files)
and I've managed to merge forward and reverse reads and to demultiplex my
samples by forward primer. 

I now need to align the reads to a reference sequence and then analyse the
alignment for SNPs and indels and calculate the allele frequencies of these.
I am hoping that different allele frequencies within a population allow us
to distinguish different varieties. 

This week I have used samtools to align the sequences and today I found
"samtools stats" which is able to give me the frequency of basepairs per
position which is a basic version of what I want. Unfortunately, it seems to
use the original sequence base position, not the position in the alignment
so when I come across an indel, all subsequent statistics are no longer
useful since it doesn't take the deletion into account.  In my case, I get
useful base frequency data at the start - most base positions have just one
base 99-100% of the time. In positions where there is a SNP, there are 2
bases with about 50% frequency each. At base 110, half of my sequences have
a short deletion so from 110 onwards, all positions have 2 bases with about
50% frequency each because it's like a double sequence where half of the
samples have shifted position because of the deletion. Here is what I see:



 

Is anyone able to suggest a solution to this? Using samtools or any other
software. Since I'm new to bioinformatics, I don't know what other tools
might be able to do this. 

 

Can samtools stats calculate base frequencies based on the position in an
alignment rather than the original sequence position? Or can I output a
fastq file of an alignment where the deletions are spaces or N's or
something like that? 

 

I could use these aligned base-level statistics to calculate allele
frequencies (which is a bit simplistic but allows me to generally ignore
random sequencing errors) or perhaps map the reads to a set of reference
files and calculate the frequency of those whole sequences (which is
probably better but I don't know how to deal with sequencing errors unless
it can group sequences with only one other SNP). 

 

Thank you very much!

Benjamin

 

Benjamin Franzmayr PhD

Scientist

 

Description: SlipStream.logo

 

M: +64 (0)21 297 5714

E:  <mailto:benja...@slipstream-automation.co.nz>
benja...@slipstream-automation.co.nz

 <http://www.Slipstream-Automation.co.nz> www.Slipstream-Automation.co.nz

 

_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to