Routing graphs are not part of the message-passing interface.
It is just something that we added in RayPlatform to run large jobs.
Louis Letourneau a écrit :
Thank you very much for the complete response.
I already started to play with -route-messages it really makes a difference!
It thought the default was -connection-type debruijn already so I didn't
set it.
By default, there is no message routing done. Thus, the communication
graph is complete.
As for the -one-color-per-file option, I was just trying stuff out.
You are right about Guillimin and it isn't fixed yet, I try once in awhile.
I did exchanged some emails with people from Intel, Inc. The problem is
with QLogic Performance Scaled Messaging (PSM), which corrupts messages.
I will send you an off-list fix to that.
The only part I don't quite get, I'll need to read up on this, is
routing-graph-degree.
The degree of a vertex is just its number of direct connections.
I'm not an mpi developper, yet, but I do program
openMP and the like jobs, so I hope the learning curve isn't too steep :-)
The paradigm is different. It is like programming a videogame. You have
a main loop
on each MPI rank in which you receive and process incoming messages, and
in which
you compute stuff and send messages.
Routing messages is not part of MPI, it is just something I added so
that I could run
assemblathon jobs.
Louis
On 12-06-18 10:42 AM, Sébastien Boisvert wrote:
Hi Louis,
I can see that you are running large jobs.
I think the problem is about message transit latency.
Some definitions:
*latency:* the required time to receive a response from a MPI rank to which
a message was sent. Usually measured in micro seconds.
*route: *the path that will be utilised to transit a message from MPI
rank A to MPI rank B.
*degree: *the number of direct connections of a MPI rank
*
diameter:* the maximum number of arcs for any shorter route in a
communication
graph.
Possible solution
#PBS -N LargeJob3
#PBS -N LargeJob3
#PBS -l walltime=120:00:00
#PBS -l nodes=100
#PBS -l walltime=120:00:00
#PBS -l walltime=120:00:00
-route-messages -connection-type debruijn -routing-graph-degree 8
*Large job latency and its solution*
There is already a solution to that, as running large jobs can be
tricky if the hardware is not designed for that in the first place.
The take-home message is that you need to turn on software message routing
implemented in RayPlatform by just adding an option (see below).
See the Ray documentation
https://github.com/sebhtml/ray/blob/master/Documentation/Very-Large-Jobs.txt
See this document for acceptable communication graphs for routing your
messages:
https://github.com/sebhtml/ray/blob/master/Documentation/Routing.txt
Below, I first describe what I think is happening with your large jobs and
then I provide a possible solution.
Louis Letourneau a écrit :
~300MB genome
2 librarires of paires (hiseq2000) (2nd library on 2 lanes)
96499384 x 2
47407253 x 2
13776832 x 2
2 libraries of mates 5kb (done with GAIIx)
26300454 x 2
25851346 x 2
Total 419670538 reads, 41967053800 bases, 139x (I know it's a bit high)
On Mammouth, if you know the cluster,
*Comparing super computers*
I tried Mammouth in the past. But our group only has an special
allocation on colosse.
I did some tests and there was not any differences between Opteron(R)
processors
on Mammouth and Xeon(R) Nehalem(R) processors on colosse because the Ray
code
does not perform any floating point operations. Both Mammouth and
Colosse have
the Infiniband(R) ConnectX(R) from Mellanox, Inc. for the network
interconnect.
We also tried guillimin, but there was (and still is) a bug in the
middleware provided by Intel, Inc.
that caused (and still causes) some messages to be corrupted.
This makes Open-MPI crash and thus Ray jobs are unable to complete on
guillimin
the last time I checked.
We never got any update about it.
In this case, the network latency on mammouth will be awful because
the hardware is Mellanox ConnectX. This hardware is good for
transfering large messages (in the
mega-byte range). Otherwise, the latency is not very good. Colosse has
the hardware. And I think
that there are other issue such as not using PCIexpress at the
appropriate rate.
using 75 nodes, 10 of 24 cores
(because of ram, I cannot use all 24 cores).
*Larges jobs and latency*
My personal record is higher.
I simulated a metagenome with 3 000 000 000 reads and 1000 bacterial
genomes.
And the assemblathon bird was 3 072 136 293 reads.
For large jobs, the problem is about latency, not about anything else,
unless your data contain a lot of sequencing errors.
The Illumina(R) HiSeq(R) 2000 from Illumina, Inc. has an error rate
valued at 0.2-0.5% in my
experience.
The Illumina(R) Genome Analyzer(R) IIx from Illumina, Inc. has an error
rate valued at 0.5-1.5%
in my experience.
*Why latency matters*
The latency increases for the following reasons:
In Ray, each MPI rank sends about 5000-8000 messages per second to other
ranks.
Message reception is done with a round-robin policy. Any MPI rank is
clocked to run at the highest frequency
allowed.
For example, on colosse with the fancy Xeon(R) Nehalem(R) from Intel
Inc., the granularity is around 500 nano
seconds. In other words, there about 1.7 millions cycles completed each
second for each MPI rank.
On colosse, the latency is between 15 and 120 micro seconds, depending
on the number
of processor cores.
You can obtain runtime information about scheduling in the Scheduling/
directory.
The network latency is reported in NetworkTest.txt
When the latency increases, the number of messages sent per second
decreases and the
wall-clock time increases too.
So basically your processor cores are just waiting for interrupts from the
Infiniband cards.
*Large jobs and memory usage*
Also, the memory usage with this number of processor cores will be
rididulous because
of the number of connections implied. with 75 nodes and 10 processes per
node, this is 750 MPI ranks,
And it is 562500 communication links.
So you are using 75 nodes with 10 Ray processes per node, for a total
of 750 Ray processes. With a large number of cores like this, you will
either need
to run your job on hardware to support transparent message routing, or
you can just
enable software message routing available already in Ray. This capacity is
implemented in the parallel engine on which Ray runs: RayPlatform.
120hours walltime is not enough
command (I removed file names and some PBS settings for brievety):
#PBS -l nodes=100
#PBS -l walltime=120:00:00
export ppn=10
With 100 nodes and 10 processes per nodes, you have 1000 MPI ranks and
1000000 communication
links.
mpiexec -n $[PBS_NUM_NODES*ppn] -npernode $ppn Ray-v2.0.0-rc8/Ray -k 23
-p pairs -p pairs -p pairs -p mates -p mates -o combined_23
-one-color-per-file -write-checkpoints> combined_23.out 2>&1
In the dev. version, the checkpoints are written to a directory instead
of the
current working directory. It is more user-friendly that way.
First, -one-color-per-file is for the colored de Bruijn graphs, which
you are
not using I think. These are helpful for profiling, for example, genera in a
microbiome. Our paper about that is under review.
It seems that you are using a lot of processor cores.
Louis
On 6/12/2012 2:47 PM, Sébastien Boisvert wrote:
Yes, it does that.
There will be binary files with the ".ray" extension in the
directory where you launched Ray.
You can not change the k-mer length when starting from old checkpoints.
The command needs to have the same number of arguments in the same order.
On what kind of dataset are you exceeding time limits ?
Louis Letourneau a écrit :
I saw these options on Ray
Checkpointing
-write-checkpoints
Write checkpoint files
-read-checkpoints
Read checkpoint files
-read-write-checkpoints
Read and write checkpoint files
I'm hitting walltimes on the cluster I'm using and I'm wondering if by
setting:
-read-write-checkpoints
I can resume where Ray got killed because of walltime?
If that's the purpose, what a great feature! :-)
Louis
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users