From: [email protected]
I have been running Ray for over 700 hours while trying to assemble
an ~800Mbp eukaryotic fish genome from paired end and mate pair
libraries. It is currently on the “Bidirectional extension of
seeds” step and has been on this step for over a week. Based on
this information, how long do you think this assembly will take?
Please let me know.

I managed to shoehorn a ~600Mbp eukaryotic worm genome into Ray (2.3.2-devel), running on a computer with 12 processing threads 64Gb memory, using 48Gb compressed swap and 20Gb SSD swap. It took a couple of days, but not more than a week [well, a bit over a week if you include the 5-or-so other attempts that caused the computer to blow up from excess memory usage]. If you're using less than 150Gb memory for your assembly, the computer is probably hitting swap and grinding to a halt. If you're using multiple computers, I guess the message passing will be slower.

That attempt (using a 5kb MP library) didn't scaffold successfully. After attending a de-novo workshop last week, I discovered that mate-pair libraries might need to be reverse-complemented to work as normal in many assembly programs, which probably explains the bad scaffolding.

Ray probably expects a particular orientation for paired-end reads (assuming it does the matching in a similar fashion to almost all other assemblers), so you'll need to reverse-complement your mate-pair reads for it to work properly.

 - David

p.s. I've attached the timings and output numbers, which may be of use to you. For example, the "bidirectional extension of seeds" took about the same amount of time as the "detection of assembly seeds" and "merging of redundant paths". The useless scaffolding step took the longest, about 3 times the length of seed extension.
#Step   Date    Elapsed time    Since Beginning
Network testing 2014-03-26T18:03:27     0 seconds       0 seconds
Counting sequences to assemble  2014-03-26T18:08:48     5 minutes, 21 seconds   
5 minutes, 21 seconds
Sequence loading        2014-03-26T18:15:29     6 minutes, 41 seconds   12 
minutes, 2 seconds
K-mer counting  2014-03-26T21:59:25     3 hours, 43 minutes, 56 seconds 3 
hours, 55 minutes, 58 seconds
Coverage distribution analysis  2014-03-26T22:00:03     38 seconds      3 
hours, 56 minutes, 36 seconds
Graph construction      2014-03-27T00:07:03     2 hours, 7 minutes, 0 seconds   
6 hours, 3 minutes, 36 seconds
Null edge purging       2014-03-27T01:18:08     1 hours, 11 minutes, 5 seconds  
7 hours, 14 minutes, 41 seconds
Selection of optimal read markers       2014-03-27T03:54:01     2 hours, 35 
minutes, 53 seconds 9 hours, 50 minutes, 34 seconds
Detection of assembly seeds     2014-03-27T07:49:50     3 hours, 55 minutes, 49 
seconds 13 hours, 46 minutes, 23 seconds
Estimation of outer distances for paired reads  2014-03-27T08:02:43     12 
minutes, 53 seconds  13 hours, 59 minutes, 16 seconds
Bidirectional extension of seeds        2014-03-27T12:26:34     4 hours, 23 
minutes, 51 seconds 18 hours, 23 minutes, 7 seconds
Merging of redundant paths      2014-03-27T16:25:44     3 hours, 59 minutes, 10 
seconds 22 hours, 22 minutes, 17 seconds
Generation of contigs   2014-03-27T16:34:55     9 minutes, 11 seconds   22 
hours, 31 minutes, 28 seconds
Scaffolding of contigs  2014-03-28T03:40:06     11 hours, 5 minutes, 11 seconds 
1 days, 9 hours, 36 minutes, 39 seconds
Counting sequences to search    2014-03-28T03:40:06     0 seconds       1 days, 
9 hours, 36 minutes, 39 seconds
Graph coloring  2014-03-28T04:03:58     23 minutes, 52 seconds  1 days, 10 
hours, 31 seconds
Counting contig biological abundances   2014-03-28T04:14:42     10 minutes, 44 
seconds  1 days, 10 hours, 11 minutes, 15 seconds
Counting sequence biological abundances 2014-03-28T04:14:42     0 seconds       
1 days, 10 hours, 11 minutes, 15 seconds
Loading taxons  2014-03-28T04:15:13     31 seconds      1 days, 10 hours, 11 
minutes, 46 seconds
Loading tree    2014-03-28T04:16:37     1 minutes, 24 seconds   1 days, 10 
hours, 13 minutes, 10 seconds
Processing gene ontologies      2014-03-28T04:17:42     1 minutes, 5 seconds    
1 days, 10 hours, 14 minutes, 15 seconds
Computing neighbourhoods        2014-03-28T04:17:48     6 seconds       1 days, 
10 hours, 14 minutes, 21 seconds
NumberOfPairedLibraries: 1

LibraryNumber: 0
 InputFormat: TwoFiles,Paired
 DetectionType: Manual
 File: normalised/left.norm.fq
  NumberOfSequences: 51961644
 File: normalised/right.norm.fq
  NumberOfSequences: 51961644
 Distribution: RayOutput/LibraryData.xml
 Peak 0
  AverageOuterDistance: 5000
  StandardDeviation: 1500

Contigs >= 100 nt
 Number: 608679
 Total length: 294389695
 Average: 483
 N50: 754
 Median: 279
 Largest: 18639
Contigs >= 500 nt
 Number: 173367
 Total length: 188894190
 Average: 1089
 N50: 1200
 Median: 865
 Largest: 18639
Scaffolds >= 100 nt
 Number: 608678
 Total length: 294390997
 Average: 483
 N50: 754
 Median: 279
 Largest: 23132
Scaffolds >= 500 nt
 Number: 173366
 Total length: 188895492
 Average: 1089
 N50: 1200
 Median: 865
 Largest: 23132
------------------------------------------------------------------------------
Start Your Social Network Today - Download eXo Platform
Build your Enterprise Intranet with eXo Platform Software
Java Based Open Source Intranet - Social, Extensible, Cloud Ready
Get Started Now And Turn Your Intranet Into A Collaboration Platform
http://p.sf.net/sfu/ExoPlatform
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to