Hello Dr. Forster,
I redacted some information and CC'ed the mailing list
as Ray is a sizable open source effort.
See my answers below.
/I think this email will be useful for anyone wanting to try Ray for
large de novo metagenome assemblies./
Forster, Robert a écrit :
Sebastion:
I saw some of your posts on the seqanswers forum. I have been trying
to find a way to get a decent assembly of a INFORMATION_REDACTED
metagenome that I want to analyze.
I have four combined treatment samples from INFORMATION_REDACTED. Two
different INFORMATION_REDACTED and two different INFORMATION_REDACTED.
Each sample has a full INFORMATION_REDACTED plate (not paired end) and
an INFORMATION_REDACTED INFORMATION_REDACTED. run (paired-end).
In total there are over 660 million reads.
So I assume you have 660 M/4 = 165 M reads per sample.
Trying to assemble all four together has not been fruitful,
It sounds to me that you have 4 samples and that you should go for 4
separate jobs.
however I was thinking of keeping each treatment separate, doing QC on
the sequences, and then trying a hybrid assembly.
If you go for separate jobs, I think you can try without any quality
control first for a testbed sample.
That would give me four assemblies.
I think this would be better too because later you will want to compare
profiles for markers between those samples.
You may be interested to know that Ray can also compute taxonomic
profiles and gene ontology profiles, although
you need additional files derived from a few databases to achieve that.
We have a paper under review about that.
I have a small Cray cluster with 7 nodes, however 5 of the nodes only
have 24GB of memory.
Ray can run on this distributed array of things.
How many processor cores do you have for each node ?
Assuming 8 processor cores per node, you have 56 processor cores and 168
GB of memory
(assuming that each node has 24 GB).
Also, how are connected your nodes ? In my experience, you will need at
least something like
1 Gigabit or 10 Gigabits. Infiniband is even better.
Would it be possible to even try Ray?
I really think it would be possible.
Does your cluster have a job scheduler installer already ?
One of the first thing you want to measure is the network latency of
your cluster.
To do that with Ray:
*mpiexec -n 56 Ray -o NetworkTest -test-network-only*
This will generate a report in *NetworkTest/NetworkTest.txt
*I am presently doing large-scale testing on the data from the
NIH Human Microbiome Project. This project has 764 samples across various
body sites.
Let me give you an complete example for a microbiome (human stool).
One of the sample is SRS011084 and has 283595168 reads.
This is twice your per-sample amount.
The command that I used to assemble and profile this with Ray was the
following.
This command instructs Ray to do de novo assembly as well as profiling
of who is in the sample
(the taxons). The pink part of the command is not necessary if you just
want to go
for a de novo assembly without any profiling. The sample has 454(R) and
Illumina(R) data.
*mpiexec -n 64 Ray \
-o \
Assembly \
-k \
31 \
-p \
Sample/SRR061903_1.fastq.gz \
Sample/SRR061903_2.fastq.gz \
-p \
Sample/SRR061904_1.fastq.gz \
Sample/SRR061904_2.fastq.gz \
-p \
Sample/SRR062102_1.fastq.gz \
Sample/SRR062102_2.fastq.gz \
-p \
Sample/SRR062103_1.fastq.gz \
Sample/SRR062103_2.fastq.gz \
-s \
Sample/SRR055711.fastq.gz \
-s \
Sample/SRR055794.fastq.gz \
-s \
Sample/SRR056932.fastq.gz \
-s \
Sample/SRR057013.fastq.gz \
-search \
/rap/nne-790-ab/genomes/EMBL_CDS+GO/EMBL_CDS_Sequences \
-gene-ontology \
/rap/nne-790-ab/genomes/EMBL_CDS+GO/000-Ontologies.txt \
/rap/nne-790-ab/genomes/EMBL_CDS+GO/000-Annotations.txt \
-search \
/rap/nne-790-ab/genomes/RayKmerSearchStuff/last-build/ARDB \
-search \
/rap/nne-790-ab/genomes/RayKmerSearchStuff/last-build/Bacteria-Genomes \
-search \
/rap/nne-790-ab/genomes/RayKmerSearchStuff/last-build/HumanChromosomes \
-search \
/rap/nne-790-ab/genomes/RayKmerSearchStuff/last-build/NCBI-Bacteria_DRAFT \
-search \
/rap/nne-790-ab/genomes/RayKmerSearchStuff/last-build/Viruses-Genomes \
-with-taxonomy \
/rap/nne-790-ab/genomes/taxonomy/last-build/Genome-to-Taxon.tsv \
/rap/nne-790-ab/genomes/taxonomy/last-build/TreeOfLife-Edges.tsv \
/rap/nne-790-ab/genomes/taxonomy/last-build/Taxon-Names.tsv*
The duration of the job was the following.
* Network testing: 12 seconds
Counting sequences to assemble: 8 minutes, 50 seconds
Sequence loading: 1 hours, 32 minutes, 15 seconds
K-mer counting: 37 minutes, 42 seconds
Coverage distribution analysis: 3 seconds
Graph construction: 1 hours, 51 minutes, 51 seconds
Null edge purging: 11 minutes, 11 seconds
Selection of optimal read markers: 40 minutes, 48 seconds
Detection of assembly seeds: 23 minutes, 23 seconds
Estimation of outer distances for paired reads: 5 minutes, 14 seconds
Bidirectional extension of seeds: 2 hours, 16 seconds
Merging of redundant paths: 53 minutes, 41 seconds
Generation of contigs: 18 seconds
Scaffolding of contigs: 46 minutes, 14 seconds
Counting sequences to search: 8 seconds
Graph coloring: 32 minutes, 59 seconds
Counting contig biological abundances: 4 minutes, 3 seconds
Counting sequence biological abundances: 1 hours, 2 minutes, 53 seconds
Loading taxons: 5 seconds
Loading tree: 7 seconds
Processing gene ontologies: 1 minutes, 3 seconds
Computing neighbourhoods: 0 seconds
Total: 10 hours, 53 minutes, 17 seconds
*
This was with a Infiniband network and the measured software latency was
63.0781 +/- 3.9568 microseconds. (from Assembly/NetworkTest.txt).
_With Gigabit ethernet, the latency will be around 150-350 microseconds._
There were 64 processor cores, 8 nodes with 24 GB each.
You have 7 nodes, but you have only half the number of reads per sample.
So this is really something possible I would say, unless your metagenome has
a really high complexity and diversity in comparison to human stool.
The resulting assembly was like this:
*./SRS011084-Ray-HMP.64/Assembly/OutputNumbers.txt*
*Contigs >= 100 nt
Number: 489482
Total length: 251953682
Average: 514
N50: 5887
Median: 156
Largest: 584260
Contigs >= 500 nt
*
And Ray dumped this profile (genus level):
(from *Assembly/BiologicalAbundances/0.Profile.TaxonomyRank=genus.tsv*)
*0.484571 Bacteroides
0.247987 Alistipes
0.0925128 Bacteroides
0.0435123 Faecalibacterium
0.0339986 Roseburia
0.0338521 Parabacteroides
0.0271901 Odoribacter
0.00526973 Clostridium
0.00453321 Eubacterium
0.00390861 Ruminococcus
0.00336484 Collinsella
0.00331663 Akkermansia
0.00278411 Blautia
0.00239709 Coprococcus
0.00221393 Subdoligranulum
0.000926717 Dorea
0.000861385 Tannerella
0.000825667 Anaerotruncus
0.000782299 Bacteroides
0.000671917 Clostridium
*
There was no human intervention in the whole process, so
assembly+profiling is really all automated
using one single executable software called Ray (Ray.exe on Microsoft
Windows).
I think Ray enables easy assembly and profiling of these microbiome samples,
assuming you have access to compute infrastructure.
Any info appreciated.
Another option for you is obviously to collaborate with a faculty person
from a Canadian university. That would give you access to Compute Canada
compute infrastructure
at no cost.
Robert J. Forster, Ph.D.
Rumen Microbial Ecology/Genomics
Lethbridge Research Centre
Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada
5403 1st Ave. S.
Lethbridge, AB T1J 4B1
robert.fors...@agr.gc.ca <mailto:robert.fors...@agr.gc.ca>
Telephone | Téléphone 403-317-2292
Facsimile | Télécopieur 403-382-3156
Government of Canada | Gouvernement du Canada
Sébastien
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users