Re: [Denovoassembler-users] Version-dependent performance of Ray

Egon Ozer Wed, 20 Aug 2014 19:12:11 -0700

Thanks for the reply, see responses below.


On Aug 20, 2014, at 4:15 PM, Boisvert, Sebastien <boisv...@anl.gov> wrote:

> 
>> From: Egon Ozer [e-o...@fsm.northwestern.edu]
>> Sent: Monday, July 21, 2014 9:30 AM
>> To: denovoassembler-us...@lists.sf.net
>> Cc: Boisvert, Sebastien
>> Subject: Version-dependent performance of Ray
>> 
>> 
>> 
>> Hi Sebastien,
>> 
>> Apologies in advance for the long email.
>> 
>> I've been struggling with some inconsistent assembly results by Ray and 
>> thought I'd bring them to your attention and see whether you had any 
>> thoughts.
>> 
>> 
>> I have been doing been doing bacterial whole-genome sequencing for some 
>> time, primarily working with Pseudomonas aeruginosa.  Ray has consistently 
>> produced great assemblies using 101 bp Illumina paired-end HiSeq reads of a 
>> variety of P. aeruginosa isolates.
>> Recently I have started sequencing isolates of another bacteria, 
>> Acinetobacter baumannii, again with Illumina Hi-Seq.  I became suspicious 
>> when I started comparing assemblies of very closely-related strains of A. 
>> baumannii that nevertheless seemed to have
>> big differences in their horizontally-transfered gene carriage.  What I 
>> found when I performed read alignments against these strains with bwa 
>> instead of de novo assembly with Ray was that several large genomic regions 
>> (some up to 70kb) were present in all
>> of the sequenced strains by alignment, but were simply not assembled by Ray 
>> in half of the sequenced strains.  When I went back to check my Pseudomonas 
>> assemblies, I did not have nearly the same amount of missing assembly.
> 
> This sounds like a bug/regression.
> 
>> 
>> So what was different?  Although performed at different times, both sets of 
>> isolates were library prepped and sequenced by the same facility on the same 
>> sequencer.  The first difference that jumped out at me was that the 
>> Pseudomonas isolates were assembled
>> a while ago using Ray v2.0.0-rc8 whereas the Acinetobacter strains were 
>> assembled using Ray-2.3.0. Perhaps herein lay the problem?  So I got 
>> systematic about it.  I had reads from one strain of A. baumannii and one 
>> strain of P. aeruginosa for which there exist
>> published, finished genomic sequences.  I decided to assemble reads from 
>> each using several recent versions of Ray.  I picked versions 1.7, 
>> 2.0.0-rc8, 2.0.0, 2.1.0, 2.2.0, 2.3.0, and 2.3.1.  I performed three 
>> assemblies of each genome using reads randomly
>> downsampled from the original read sets to about 80x genomic coverage.  To 
>> put it a different way, here is the order of assembly:
>> 1) Downsample AB reads to ~ 80x coverage -> "AB read set 1"
>> 2) Assemble AB read set 1 with Ray v1.7 through 2.3.1
>> 3) Downsample AB reads to ~ 80x coverage -> "AB read set 2"
>> 4) Assemble AB read set 2 with Ray v1.7 through 2.3.1
>> 5) Downsample AB reads to ~ 80x coverage -> "AB read set 3"
>> 6) Assemble AB read set 3 with Ray v1.7 through 2.3.1
>> Repeat for PA reads
>> 
>> I then used Quast to align the AB and PA contigs >= 200 bp to their 
>> respective reference genomic sequences.  The columns are averages of 3 runs 
>> and error bars show standard errors of the three runs.
>> 
>> 
>> 
>> 
>> 
>> As you can (hopefully) see from the graph, if it shows up, the performance 
>> of Ray is relatively consistent across versions for the P. aeruginosa reads, 
>> but the versions after the transition to version 2 all seem to perform worse 
>> in assembling the A. baumannii
>> genome.  Incidentally, alignment of the raw reads to their respective 
>> reference genomes using bwa yields 99.99% coverage of the genomes by the 
>> reads.
>> 
>> 
>> In each of these cases, the Ray versions were run with a kmer size of 31 
>> using downsampled, but otherwise unmodified reads.
> 
> When you say downsampled, do you mean digital normalization ?

No, I randomly selected a smaller set of read pairs from the original full set 
of reads.  For example, to achieve roughly 80x coverage of a 4 Mbp 
Acinetobacter genome, I randomly selected 1,584,158 reads from each of the read 
files (containing 11,191,815 reads each), maintaining read pairing.  So I 
downsampled from about 565 x genome coverage to about 80 x genome coverage, 
randomly.

> 
>> I have experimented I think fairly extensively with read quality trimming 
>> and error correction and have consistently ended up with worse
>> assembly results (N50, contigs vs. reference alignment, etc) than when 
>> simply using raw reads as they come off the sequencer.  Still haven't 
>> figured out why that is, but not really the issue at hand today...
> 
> Ray always had this behavior as far as I know: it performs better with raw 
> reads without normalization and without trimming (at least when k < 51).

That has consistently been my experience.  I’m starting to give myself 
permission to not to keep trying different read trimmers or normalizers every 
few months and getting worse assemblies before going back to just feeding the 
assembler raw reads.

Now that I’m moving into more MiSeq sequencing using Nextera library preps, I’m 
wondering if I’m going to have to pay attention to filtering out transposon 
sequence from my reads.  We’ll see.

> 
>> 
>> Ultimately, I decided to try a few different kmer settings on the 
>> Acinetobacter strain using Ray v2.3.1, again in triplicate with 80x 
>> downsampled reads.  As you can see below, no kmer setting I tried reached 
>> the coverage levels produced by Ray v1.7.
>> 
>> 
>> 
>> 
>> 
>> What else is different between these two bacteria?  Pseudomonas aeruginosa 
>> is a longer genome at about 6.6 Mb compared to 4.0 Mb for A. baumannii.  P. 
>> aeruginosa's GC content is about 66.6% whereas A. baumanni has a GC content 
>> of about 38.8%.  A. baumannii
>> usually has 1 - 4 plasmids and P. aeruginosa often does not have any plasmid 
>> DNA.  The A. baumannii and P. aeruginosa libraries studied here were prepped 
>> and sequenced at different times, however the fact that Ray v1.7 can 
>> assemble much more of the genome
>> than the other versions suggests to me that the library prep is not the 
>> issue.  In addition, although I don't have reference strains of all of the 
>> 150 A. baumannii strains I have sequenced, cross comparisons of their 
>> assemblies suggests that large amounts
>> of genomic sequence are missing from several of the assemblies.
> 
>> 
>> I'm not really sure why there seems to be so much variability between the 
>> assembler versions for this particular bacterium.  Can you think of what may 
>> have changed during the transition from the version 1.x series of Ray to the 
>> 2.x series that could account
>> for the differences I'm seeing?
> 
> The 2.x.x series put the enphasis on metagenomes, so yes some things were 
> changed indeed.
> 
> See "From Ray to Ray Meta" in the discussion in 
> http://genomebiology.com/2012/13/12/R122#sec3
> 
>> Or anything you can think of that I could be doing wrong on my end?
> 
> No, your work looks quite good.
> 
>> Within the limitations of de novo assembly, I really need to maximize genome 
>> representation in my contigs for the analyses I am performing to be valid.
> 
> Yes, obviously.
> 
> We compared several assemblers (Spades, Meta-IDBA, Ray 2.x, Meta-Velvet) on 
> metagenome samples and the total assembly does vary.
> 
> 
> Unfortunately for Ray, I am no longer actively working on it. I am now 
> working on Spate.
> 
> I can tell you that Spate will have less bugs than Ray (Ray has a few bugs, 
> like any software).
> 
> We are running tests daily for the biosal / thorium / spate project.
> 
> => http://biosal.s3.amazonaws.com/index.html
> 
> The good thing for Ray users is that Spate will be faster than Ray, and use 
> the same command line option.
> And it will have less bugs too (at least that what we want !).
> 

OK. I’ll keep an eye on Spate progress.  I found in my early tests with Spades 
that it does a somewhat better job of de novo assembly of Acinetobacter genomes 
than the newer versions of Ray, but I’m looking forward to trying Spate .

>> 
>> Thanks,
>> 
>> Egon Ozer, MD PhD
>> Northwestern University, Chicago IL
>> 
>> 
>> 
>> 
> 
> ------------------------------------------------------------------------------
> Slashdot TV.  
> Video for Nerds.  Stuff that matters.
> http://tv.slashdot.org/
> _______________________________________________
> Denovoassembler-users mailing list
> Denovoassembler-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users


------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Version-dependent performance of Ray

Reply via email to