Hi
I have done my first amplicon sequencing run recently and am trying to analyse the data at the same time as I learn how to use linux, shell commands and bioinformatics software. I have amplified 5 genes from different varieties of a plant (100 individuals from one variety per DNA template pool) and then pooled those 5 PCRs together for indexing and sequencing. My sequencing provider has demultiplexed by sample (231 files) and I've managed to merge forward and reverse reads and to demultiplex my samples by forward primer. I now need to align the reads to a reference sequence and then analyse the alignment for SNPs and indels and calculate the allele frequencies of these. I am hoping that different allele frequencies within a population allow us to distinguish different varieties. This week I have used samtools to align the sequences and today I found "samtools stats" which is able to give me the frequency of basepairs per position which is a basic version of what I want. Unfortunately, it seems to use the original sequence base position, not the position in the alignment so when I come across an indel, all subsequent statistics are no longer useful since it doesn't take the deletion into account. In my case, I get useful base frequency data at the start - most base positions have just one base 99-100% of the time. In positions where there is a SNP, there are 2 bases with about 50% frequency each. At base 110, half of my sequences have a short deletion so from 110 onwards, all positions have 2 bases with about 50% frequency each because it's like a double sequence where half of the samples have shifted position because of the deletion. Here is what I see: Is anyone able to suggest a solution to this? Using samtools or any other software. Since I'm new to bioinformatics, I don't know what other tools might be able to do this. Can samtools stats calculate base frequencies based on the position in an alignment rather than the original sequence position? Or can I output a fastq file of an alignment where the deletions are spaces or N's or something like that? I could use these aligned base-level statistics to calculate allele frequencies (which is a bit simplistic but allows me to generally ignore random sequencing errors) or perhaps map the reads to a set of reference files and calculate the frequency of those whole sequences (which is probably better but I don't know how to deal with sequencing errors unless it can group sequences with only one other SNP). Thank you very much! Benjamin Benjamin Franzmayr PhD Scientist Description: SlipStream.logo M: +64 (0)21 297 5714 E: <mailto:benja...@slipstream-automation.co.nz> benja...@slipstream-automation.co.nz <http://www.Slipstream-Automation.co.nz> www.Slipstream-Automation.co.nz
_______________________________________________ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help