> From: Egon Ozer [e-o...@fsm.northwestern.edu] > Sent: Monday, July 21, 2014 9:30 AM > To: denovoassembler-us...@lists.sf.net > Cc: Boisvert, Sebastien > Subject: Version-dependent performance of Ray > > > > Hi Sebastien, > > Apologies in advance for the long email. > > I've been struggling with some inconsistent assembly results by Ray and > thought I'd bring them to your attention and see whether you had any thoughts. > > > I have been doing been doing bacterial whole-genome sequencing for some time, > primarily working with Pseudomonas aeruginosa. Ray has consistently produced > great assemblies using 101 bp Illumina paired-end HiSeq reads of a variety of > P. aeruginosa isolates. > Recently I have started sequencing isolates of another bacteria, > Acinetobacter baumannii, again with Illumina Hi-Seq. I became suspicious > when I started comparing assemblies of very closely-related strains of A. > baumannii that nevertheless seemed to have > big differences in their horizontally-transfered gene carriage. What I found > when I performed read alignments against these strains with bwa instead of de > novo assembly with Ray was that several large genomic regions (some up to > 70kb) were present in all > of the sequenced strains by alignment, but were simply not assembled by Ray > in half of the sequenced strains. When I went back to check my Pseudomonas > assemblies, I did not have nearly the same amount of missing assembly.
This sounds like a bug/regression. > > So what was different? Although performed at different times, both sets of > isolates were library prepped and sequenced by the same facility on the same > sequencer. The first difference that jumped out at me was that the > Pseudomonas isolates were assembled > a while ago using Ray v2.0.0-rc8 whereas the Acinetobacter strains were > assembled using Ray-2.3.0. Perhaps herein lay the problem? So I got > systematic about it. I had reads from one strain of A. baumannii and one > strain of P. aeruginosa for which there exist > published, finished genomic sequences. I decided to assemble reads from each > using several recent versions of Ray. I picked versions 1.7, 2.0.0-rc8, > 2.0.0, 2.1.0, 2.2.0, 2.3.0, and 2.3.1. I performed three assemblies of each > genome using reads randomly > downsampled from the original read sets to about 80x genomic coverage. To > put it a different way, here is the order of assembly: > 1) Downsample AB reads to ~ 80x coverage -> "AB read set 1" > 2) Assemble AB read set 1 with Ray v1.7 through 2.3.1 > 3) Downsample AB reads to ~ 80x coverage -> "AB read set 2" > 4) Assemble AB read set 2 with Ray v1.7 through 2.3.1 > 5) Downsample AB reads to ~ 80x coverage -> "AB read set 3" > 6) Assemble AB read set 3 with Ray v1.7 through 2.3.1 > Repeat for PA reads > > I then used Quast to align the AB and PA contigs >= 200 bp to their > respective reference genomic sequences. The columns are averages of 3 runs > and error bars show standard errors of the three runs. > > > > > > As you can (hopefully) see from the graph, if it shows up, the performance of > Ray is relatively consistent across versions for the P. aeruginosa reads, but > the versions after the transition to version 2 all seem to perform worse in > assembling the A. baumannii > genome. Incidentally, alignment of the raw reads to their respective > reference genomes using bwa yields 99.99% coverage of the genomes by the > reads. > > > In each of these cases, the Ray versions were run with a kmer size of 31 > using downsampled, but otherwise unmodified reads. When you say downsampled, do you mean digital normalization ? > I have experimented I think fairly extensively with read quality trimming and > error correction and have consistently ended up with worse > assembly results (N50, contigs vs. reference alignment, etc) than when simply > using raw reads as they come off the sequencer. Still haven't figured out > why that is, but not really the issue at hand today... Ray always had this behavior as far as I know: it performs better with raw reads without normalization and without trimming (at least when k < 51). > > Ultimately, I decided to try a few different kmer settings on the > Acinetobacter strain using Ray v2.3.1, again in triplicate with 80x > downsampled reads. As you can see below, no kmer setting I tried reached the > coverage levels produced by Ray v1.7. > > > > > > What else is different between these two bacteria? Pseudomonas aeruginosa is > a longer genome at about 6.6 Mb compared to 4.0 Mb for A. baumannii. P. > aeruginosa's GC content is about 66.6% whereas A. baumanni has a GC content > of about 38.8%. A. baumannii > usually has 1 - 4 plasmids and P. aeruginosa often does not have any plasmid > DNA. The A. baumannii and P. aeruginosa libraries studied here were prepped > and sequenced at different times, however the fact that Ray v1.7 can assemble > much more of the genome > than the other versions suggests to me that the library prep is not the > issue. In addition, although I don't have reference strains of all of the > 150 A. baumannii strains I have sequenced, cross comparisons of their > assemblies suggests that large amounts > of genomic sequence are missing from several of the assemblies. > > I'm not really sure why there seems to be so much variability between the > assembler versions for this particular bacterium. Can you think of what may > have changed during the transition from the version 1.x series of Ray to the > 2.x series that could account > for the differences I'm seeing? The 2.x.x series put the enphasis on metagenomes, so yes some things were changed indeed. See "From Ray to Ray Meta" in the discussion in http://genomebiology.com/2012/13/12/R122#sec3 > Or anything you can think of that I could be doing wrong on my end? No, your work looks quite good. > Within the limitations of de novo assembly, I really need to maximize genome > representation in my contigs for the analyses I am performing to be valid. Yes, obviously. We compared several assemblers (Spades, Meta-IDBA, Ray 2.x, Meta-Velvet) on metagenome samples and the total assembly does vary. Unfortunately for Ray, I am no longer actively working on it. I am now working on Spate. I can tell you that Spate will have less bugs than Ray (Ray has a few bugs, like any software). We are running tests daily for the biosal / thorium / spate project. => http://biosal.s3.amazonaws.com/index.html The good thing for Ray users is that Spate will be faster than Ray, and use the same command line option. And it will have less bugs too (at least that what we want !). > > Thanks, > > Egon Ozer, MD PhD > Northwestern University, Chicago IL > > > > ------------------------------------------------------------------------------ Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users