Re: [galaxy-user] Assemble a consensus genome from NGS data

Benjamin Dickins Fri, 08 Apr 2011 07:44:22 -0700

Hi David,
I'm sorry for a slow response. Relatively recently I solved a problem a bit 
like this and would be happy to share more information with you. If your genome 
is small I think it makes sense to map to a reference and identify variant 
sites. (In my opinion de novo assembly isn't needed - see below).

A basic approach is: groom FASTA file -> map with BWA -> filter SAM (uniquely 
mapped reads only) -> SAM-to-BAM -> Generate pileup -> Filter pileup

This gives you a position-by-position summary relative to the reference. And 
that last step is important and needs the most care: you can have it print out 
differences total numbers of non-reference bases. I can share some information 
about thresholding how many of these constitute significant evidence that a 
non-reference base is actually there at that position (basically I use a 
binomial distribution and ask whether the distribution of ref/non-ref would 
occur by chance). Given that coverage of small genomes tends to be high, your 
first question about determining the actual genome sequence (or the 
quasispecies consensus if you prefer!) can be answered by majority rules: i.e., 
a small script (or with tools under "Text Manipulation" heading) to read off 
the base with the most support at each position and then to test whether that 
base == base in reference nucleotide column.

It's probably also worth thinking about PCR duplicates (from library prep) as 
these could be a significant source of error, but they are also tricky when 
many reads will be identical anyway in the input DNA.

Feel free to get in touch with me if you need a bit more clarity and/or some 
more specifics...

cheers,
Ben

On Apr 4, 2011, at 9:55 PM, Anton Nekrutenko wrote:

>> From: David Matthews <[email protected]>
>> Date: April 4, 2011 6:02:03 PM EDT
>> To: [email protected]
>> Subject: [galaxy-user] Assemble a consensus genome from NGS data
>> 
>> Hi,
>> 
>> Does anyone know how to get a consensus genome from NGS data indicating the 
>> percent variance at each nucleotide? I have a small virus genome with 
>> manyfold coverage from my transcriptomic run. I'd like to know what the 
>> transcriptome indicates is the actual genome plus get a feel for any 
>> hotspots where there appears to be significant varience from the reference 
>> sequence (i.e. because the reference is wrong or perhaps because of frequent 
>> errors in that region due to RNA pol II having a problem accurately 
>> transcribing the sequence).
>> 
>> Many thanks!
>> 
>> David
>> 
>> 
>> ___________________________________________________________
>> The Galaxy User list should be used for the discussion of
>> Galaxy analysis and other features on the public server
>> at usegalaxy.org.  Please keep all replies on the list by
>> using "reply all" in your mail client.  For discussion of
>> local Galaxy instances and the Galaxy source code, please
>> use the Galaxy Development list:
>> 
>>  http://lists.bx.psu.edu/listinfo/galaxy-dev
>> 
>> To manage your subscriptions to this and other Galaxy lists,
>> please use the interface at:
>> 
>>  http://lists.bx.psu.edu/
> 

Benjamin Dickins
Postdoctoral Researcher
Center for Comparative Genomics and Bioinformatics
The Pennsylvania State University
------------------------------------------------------------
302 Wartik Laboratory
University Park, PA 16802, USA
Cell/mobile: +1 814 777 1852
Office tel: +1 814 863 2185
Office fax: +1 814 865 9131
Website: http://www.bendickins.net/
Weblog: http://www.open.ac.uk/blogs/ideasblog/

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] Assemble a consensus genome from NGS data

Reply via email to