From: [email protected]
I have been running Ray for over 700 hours while trying to assemble an ~800Mbp eukaryotic fish genome from paired end and mate pair libraries. It is currently on the “Bidirectional extension of seeds” step and has been on this step for over a week. Based on this information, how long do you think this assembly will take? Please let me know.
I managed to shoehorn a ~600Mbp eukaryotic worm genome into Ray (2.3.2-devel), running on a computer with 12 processing threads 64Gb memory, using 48Gb compressed swap and 20Gb SSD swap. It took a couple of days, but not more than a week [well, a bit over a week if you include the 5-or-so other attempts that caused the computer to blow up from excess memory usage]. If you're using less than 150Gb memory for your assembly, the computer is probably hitting swap and grinding to a halt. If you're using multiple computers, I guess the message passing will be slower.
That attempt (using a 5kb MP library) didn't scaffold successfully. After attending a de-novo workshop last week, I discovered that mate-pair libraries might need to be reverse-complemented to work as normal in many assembly programs, which probably explains the bad scaffolding.
Ray probably expects a particular orientation for paired-end reads (assuming it does the matching in a similar fashion to almost all other assemblers), so you'll need to reverse-complement your mate-pair reads for it to work properly.
- Davidp.s. I've attached the timings and output numbers, which may be of use to you. For example, the "bidirectional extension of seeds" took about the same amount of time as the "detection of assembly seeds" and "merging of redundant paths". The useless scaffolding step took the longest, about 3 times the length of seed extension.
#Step Date Elapsed time Since Beginning Network testing 2014-03-26T18:03:27 0 seconds 0 seconds Counting sequences to assemble 2014-03-26T18:08:48 5 minutes, 21 seconds 5 minutes, 21 seconds Sequence loading 2014-03-26T18:15:29 6 minutes, 41 seconds 12 minutes, 2 seconds K-mer counting 2014-03-26T21:59:25 3 hours, 43 minutes, 56 seconds 3 hours, 55 minutes, 58 seconds Coverage distribution analysis 2014-03-26T22:00:03 38 seconds 3 hours, 56 minutes, 36 seconds Graph construction 2014-03-27T00:07:03 2 hours, 7 minutes, 0 seconds 6 hours, 3 minutes, 36 seconds Null edge purging 2014-03-27T01:18:08 1 hours, 11 minutes, 5 seconds 7 hours, 14 minutes, 41 seconds Selection of optimal read markers 2014-03-27T03:54:01 2 hours, 35 minutes, 53 seconds 9 hours, 50 minutes, 34 seconds Detection of assembly seeds 2014-03-27T07:49:50 3 hours, 55 minutes, 49 seconds 13 hours, 46 minutes, 23 seconds Estimation of outer distances for paired reads 2014-03-27T08:02:43 12 minutes, 53 seconds 13 hours, 59 minutes, 16 seconds Bidirectional extension of seeds 2014-03-27T12:26:34 4 hours, 23 minutes, 51 seconds 18 hours, 23 minutes, 7 seconds Merging of redundant paths 2014-03-27T16:25:44 3 hours, 59 minutes, 10 seconds 22 hours, 22 minutes, 17 seconds Generation of contigs 2014-03-27T16:34:55 9 minutes, 11 seconds 22 hours, 31 minutes, 28 seconds Scaffolding of contigs 2014-03-28T03:40:06 11 hours, 5 minutes, 11 seconds 1 days, 9 hours, 36 minutes, 39 seconds Counting sequences to search 2014-03-28T03:40:06 0 seconds 1 days, 9 hours, 36 minutes, 39 seconds Graph coloring 2014-03-28T04:03:58 23 minutes, 52 seconds 1 days, 10 hours, 31 seconds Counting contig biological abundances 2014-03-28T04:14:42 10 minutes, 44 seconds 1 days, 10 hours, 11 minutes, 15 seconds Counting sequence biological abundances 2014-03-28T04:14:42 0 seconds 1 days, 10 hours, 11 minutes, 15 seconds Loading taxons 2014-03-28T04:15:13 31 seconds 1 days, 10 hours, 11 minutes, 46 seconds Loading tree 2014-03-28T04:16:37 1 minutes, 24 seconds 1 days, 10 hours, 13 minutes, 10 seconds Processing gene ontologies 2014-03-28T04:17:42 1 minutes, 5 seconds 1 days, 10 hours, 14 minutes, 15 seconds Computing neighbourhoods 2014-03-28T04:17:48 6 seconds 1 days, 10 hours, 14 minutes, 21 seconds
NumberOfPairedLibraries: 1 LibraryNumber: 0 InputFormat: TwoFiles,Paired DetectionType: Manual File: normalised/left.norm.fq NumberOfSequences: 51961644 File: normalised/right.norm.fq NumberOfSequences: 51961644 Distribution: RayOutput/LibraryData.xml Peak 0 AverageOuterDistance: 5000 StandardDeviation: 1500
Contigs >= 100 nt Number: 608679 Total length: 294389695 Average: 483 N50: 754 Median: 279 Largest: 18639 Contigs >= 500 nt Number: 173367 Total length: 188894190 Average: 1089 N50: 1200 Median: 865 Largest: 18639 Scaffolds >= 100 nt Number: 608678 Total length: 294390997 Average: 483 N50: 754 Median: 279 Largest: 23132 Scaffolds >= 500 nt Number: 173366 Total length: 188895492 Average: 1089 N50: 1200 Median: 865 Largest: 23132
------------------------------------------------------------------------------ Start Your Social Network Today - Download eXo Platform Build your Enterprise Intranet with eXo Platform Software Java Based Open Source Intranet - Social, Extensible, Cloud Ready Get Started Now And Turn Your Intranet Into A Collaboration Platform http://p.sf.net/sfu/ExoPlatform
_______________________________________________ Denovoassembler-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
