Re: [Denovoassembler-users] RAY and memory requirements for metagenomes

Sébastien Boisvert Thu, 21 Jun 2012 13:17:57 -0700

Hello Dr. Forster,


I redacted some information and CC'ed the mailing list
as Ray is a sizable open source effort.


See my answers below.

/I think this email will be useful for anyone wanting to try Ray for
large de novo metagenome assemblies./

Forster, Robert a écrit :

Sebastion:
I saw some of your posts on the seqanswers forum. I have been tryingto find a way to get a decent assembly of a INFORMATION_REDACTEDmetagenome that I want to analyze.
I have four combined treatment samples from INFORMATION_REDACTED. Twodifferent INFORMATION_REDACTED and two different INFORMATION_REDACTED.Each sample has a full INFORMATION_REDACTED plate (not paired end) andan INFORMATION_REDACTED INFORMATION_REDACTED. run (paired-end).

In total there are over 660 million reads.


So I assume you have 660 M/4 = 165 M reads per sample.

Trying to assemble all four together has not been fruitful,

It sounds to me that you have 4 samples and that you should go for 4separate jobs.

however I was thinking of keeping each treatment separate, doing QC onthe sequences, and then trying a hybrid assembly.

If you go for separate jobs, I think you can try without any qualitycontrol first for a testbed sample.

 That would give me four assemblies.

I think this would be better too because later you will want to compareprofiles for markers between those samples.

You may be interested to know that Ray can also compute taxonomicprofiles and gene ontology profiles, although

you need additional files derived from a few databases to achieve that.

We have a paper under review about that.

I have a small Cray cluster with 7 nodes, however 5 of the nodes onlyhave 24GB of memory.



Ray can run on this distributed array of things.


How many processor cores do you have for each node ?

Assuming 8 processor cores per node, you have 56 processor cores and 168GB of memory

(assuming that each node has 24 GB).

Also, how are connected your nodes ? In my experience, you will need atleast something like

1 Gigabit or 10 Gigabits. Infiniband is even better.

 Would it be possible to even try Ray?


I really think it would be possible.

Does your cluster have a job scheduler installer already ?

One of the first thing you want to measure is the network latency ofyour cluster.


To do that with Ray:

*mpiexec -n 56 Ray -o NetworkTest -test-network-only*

This will generate a report in *NetworkTest/NetworkTest.txt


*I am presently doing large-scale testing on the data from the
NIH Human Microbiome Project. This project has 764 samples across various
body sites.

Let me give you an complete example for a microbiome (human stool).

One of the sample is SRS011084 and has 283595168 reads.
This is twice your per-sample amount.

The command that I used to assemble and profile this with Ray was thefollowing.This command instructs Ray to do de novo assembly as well as profilingof who is in the sample(the taxons). The pink part of the command is not necessary if you justwant to gofor a de novo assembly without any profiling. The sample has 454(R) andIllumina(R) data.



*mpiexec -n 64 Ray \
 -o \
 Assembly \
 -k \
 31 \
 -p \
 Sample/SRR061903_1.fastq.gz \
 Sample/SRR061903_2.fastq.gz \
 -p \
 Sample/SRR061904_1.fastq.gz \
 Sample/SRR061904_2.fastq.gz \
 -p \
 Sample/SRR062102_1.fastq.gz \
 Sample/SRR062102_2.fastq.gz \
 -p \
 Sample/SRR062103_1.fastq.gz \
 Sample/SRR062103_2.fastq.gz \
 -s \
 Sample/SRR055711.fastq.gz \
 -s \
 Sample/SRR055794.fastq.gz \
 -s \
 Sample/SRR056932.fastq.gz \
 -s \
 Sample/SRR057013.fastq.gz \
 -search \
 /rap/nne-790-ab/genomes/EMBL_CDS+GO/EMBL_CDS_Sequences \
 -gene-ontology \
 /rap/nne-790-ab/genomes/EMBL_CDS+GO/000-Ontologies.txt \
 /rap/nne-790-ab/genomes/EMBL_CDS+GO/000-Annotations.txt \
 -search \
 /rap/nne-790-ab/genomes/RayKmerSearchStuff/last-build/ARDB \
 -search \
 /rap/nne-790-ab/genomes/RayKmerSearchStuff/last-build/Bacteria-Genomes \
 -search \
 /rap/nne-790-ab/genomes/RayKmerSearchStuff/last-build/HumanChromosomes \
 -search \
 /rap/nne-790-ab/genomes/RayKmerSearchStuff/last-build/NCBI-Bacteria_DRAFT \
 -search \
 /rap/nne-790-ab/genomes/RayKmerSearchStuff/last-build/Viruses-Genomes \
 -with-taxonomy \
 /rap/nne-790-ab/genomes/taxonomy/last-build/Genome-to-Taxon.tsv \
 /rap/nne-790-ab/genomes/taxonomy/last-build/TreeOfLife-Edges.tsv \
 /rap/nne-790-ab/genomes/taxonomy/last-build/Taxon-Names.tsv*


The duration of the job was the following.

* Network testing: 12 seconds
 Counting sequences to assemble: 8 minutes, 50 seconds
 Sequence loading: 1 hours, 32 minutes, 15 seconds
 K-mer counting: 37 minutes, 42 seconds
 Coverage distribution analysis: 3 seconds
 Graph construction: 1 hours, 51 minutes, 51 seconds
 Null edge purging: 11 minutes, 11 seconds
 Selection of optimal read markers: 40 minutes, 48 seconds
 Detection of assembly seeds: 23 minutes, 23 seconds
 Estimation of outer distances for paired reads: 5 minutes, 14 seconds
 Bidirectional extension of seeds: 2 hours, 16 seconds
 Merging of redundant paths: 53 minutes, 41 seconds
 Generation of contigs: 18 seconds
 Scaffolding of contigs: 46 minutes, 14 seconds
 Counting sequences to search: 8 seconds
 Graph coloring: 32 minutes, 59 seconds
 Counting contig biological abundances: 4 minutes, 3 seconds
 Counting sequence biological abundances: 1 hours, 2 minutes, 53 seconds
 Loading taxons: 5 seconds
 Loading tree: 7 seconds
 Processing gene ontologies: 1 minutes, 3 seconds
 Computing neighbourhoods: 0 seconds
 Total: 10 hours, 53 minutes, 17 seconds
*
This was with a Infiniband network and the measured software latency was
63.0781 +/- 3.9568 microseconds. (from Assembly/NetworkTest.txt).

_With Gigabit ethernet, the latency will be around 150-350 microseconds._

There were 64 processor cores, 8 nodes with 24 GB each.
You have 7 nodes, but you have only half the number of reads per sample.

So this is really something possible I would say, unless your metagenome has
a really high complexity and diversity in comparison to human stool.


The resulting assembly was like this:

*./SRS011084-Ray-HMP.64/Assembly/OutputNumbers.txt*
*Contigs >= 100 nt
 Number: 489482
 Total length: 251953682
 Average: 514
 N50: 5887
 Median: 156
 Largest: 584260
Contigs >= 500 nt
*

And Ray dumped this profile (genus level):
(from *Assembly/BiologicalAbundances/0.Profile.TaxonomyRank=genus.tsv*)

*0.484571 Bacteroides
0.247987 Alistipes
0.0925128 Bacteroides
0.0435123 Faecalibacterium
0.0339986 Roseburia
0.0338521 Parabacteroides
0.0271901 Odoribacter
0.00526973 Clostridium
0.00453321 Eubacterium
0.00390861 Ruminococcus
0.00336484 Collinsella
0.00331663 Akkermansia
0.00278411 Blautia
0.00239709 Coprococcus
0.00221393 Subdoligranulum
0.000926717 Dorea
0.000861385 Tannerella
0.000825667 Anaerotruncus
0.000782299 Bacteroides
0.000671917 Clostridium
*

There was no human intervention in the whole process, soassembly+profiling is really all automatedusing one single executable software called Ray (Ray.exe on MicrosoftWindows).


I think Ray enables easy assembly and profiling of these microbiome samples,
assuming you have access to compute infrastructure.

Any info appreciated.


Another option for you is obviously to collaborate with a faculty person
from a Canadian university. That would give you access to Compute Canada
compute infrastructure
at no cost.

Robert J. Forster, Ph.D.
Rumen Microbial Ecology/Genomics
Lethbridge Research Centre
Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada
5403 1st Ave. S.
Lethbridge, AB T1J 4B1
robert.fors...@agr.gc.ca <mailto:robert.fors...@agr.gc.ca>
Telephone | Téléphone 403-317-2292
Facsimile | Télécopieur 403-382-3156

Government of Canada | Gouvernement du Canada



                            Sébastien

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] RAY and memory requirements for metagenomes

Reply via email to