On 25/09/13 10:30 PM, Ranjeev wrote: > > > I'm trying your suggestion by using -bloom-filter-bits 42576147183 calculated > per rank (total=72;almost 5gb per rank). My total distributed memory is > 384Gb. Is this correct? >
72 * 5 GiB is 360 GiB already. I think it is too much. I fixed the bug in Ray regarding the Bloom filter. The fix is here: https://github.com/sebhtml/ray/commit/885e3010ccdb587e84b3d43f7a5e598b8f187c6f So if you are using the git version, you need to pull. > However, I cannot proceed. I get the following error intermittently: Is this > an openmpi issue or ray? > > [[64004,1],3][btl_openib_component.c:3496:handle_wc] from compute-0-0.local > to: compute-0-2 error polling HP CQ with status LOCAL PROTOCOL ERROR status > number > 4 for wr_id 16f30800 opcode 128 vendor error 52 qp_idx 0 > [[64004,1],56][btl_openib_component.c:3496:handle_wc] from compute-0-2.local > to: compute-0-1 error polling LP CQ with status REMOTE OPERATION ERROR status > number 11 for wr_id 13e06b00 opcode 128 vendor error 137 qp_idx 0 > openib is the Open-MPI component for Infiniband. This looks like an Infiniband issue. > > On Tue, Sep 24, 2013 at 12:20 AM, Sébastien Boisvert > <sebastien.boisver...@ulaval.ca <mailto:sebastien.boisver...@ulaval.ca>> > wrote: > > On 23/09/13 11:48 AM, VJR VJR wrote: > > I am using the latest 2.2 as in the attached log I think. > > > Hi Ranjeev, > > "-80285528 bytes" is an integer overflow. > > The problem was in part fixed in the git repository. > > > But basically, 145745448 reads is a lot of reads for a single MPI rank. > > > Here are 2 possible workarounds that I can offer: > > 1. Set the number of bits manually with -bloom-filter-bits. For example, > to use 512 MiB of memory > for the Bloom filter on each MPI rank, use -bloom-filter-bits > 4398046511104. > > 2. Use more MPI ranks (more processor cores). > If you go from 20 to 200 MPI ranks, everything will be faster etc. > > > > > Ranjeev > > > Sent from my Windows 8 phone > > ------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__------------------------------__----------------------------- -__------------------------------__------------------------------ > From: Sébastien Boisvert <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:sebastien.boisver...@ulaval.ca>> > > Sent: 23/9/2013 9:41 PM > To: Ranjeev <mailto:ranj...@um.edu.my <mailto:ranj...@um.edu.my>> > Cc: s...@boisvert.info <mailto:s...@boisvert.info> > <mailto:s...@boisvert.info <mailto:s...@boisvert.info>>; > denovoassembler-users@lists.__sourceforge.net > <mailto:denovoassembler-users@lists.sourceforge.net> > <mailto:denovoassembler-users@__lists.sourceforge.net > <mailto:denovoassembler-users@lists.sourceforge.net>>; choo siew woh > <mailto:csw1...@hotmail.com <mailto:csw1...@hotmail.com>>; Shi Yang Tan > <mailto:shiyang...@gmail.com <mailto:shiyang...@gmail.com>> > > Subject: Re: Ray network test takes forever > > On 23/09/13 12:37 AM, Ranjeev wrote: > > I managed to get Ray running however, I get the some memory > problem which I cant really understand. I ran the pbs on three nodes without > multiple processes and still the error shows up. > > Hello, > > It seems to me that 145745448 reads is a lot of reads for a single > MPI rank. > > > However, "-80285528 bytes" is definitely a bug in Ray. > > Which version of Ray are you using ? > > > > > qsub -l nodes=compute-0-0+compute-0-1+__compute-0-2 ray_p.pbs > > > > Attached is the log. Below the summary > > > > *** > > Step: Sequence loading > > Date: Mon Sep 23 10:50:19 2013 > > Elapsed time: 15 minutes, 57 seconds > > Since beginning: 27 minutes, 2 seconds > > *** > > > > > > Rank 2 has 145745448 sequence reads (completed) > > Critical exception: The system is out of memory, returned NULL. > > Requested -80285528 bytes of type RAY_MALLOC_TYPE_BLOOM_FILTER > > Critical exception: The system is out of memory, returned NULL. > > Requested -80285528 bytes of type RAY_MALLOC_TYPE_BLOOM_FILTER > > Critical exception: The system is out of memory, returned NULL. > > Requested -80285528 bytes of type RAY_MALLOC_TYPE_BLOOM_FILTER > > > ------------------------------__------------------------------__-------------- > > mpiexec has exited due to process rank 0 with PID 8831 on > > node compute-0-0.local exiting improperly. There are two reasons > this could occur: > > > > 1. this process did not call "init" before exiting, but others in > > the job did. This can cause a job to hang indefinitely while it > waits > > for all processes to call "init". By rule, if one process calls > "init", > > then ALL processes must call "init" prior to termination. > > > > 2. this process called "init", but exited without calling > "finalize". > > By rule, all processes that call "init" MUST call "finalize" > prior to > > exiting or it will be considered an "abnormal termination" > > > > This may have caused other processes in the application to be > > terminated by signals sent by mpiexec (as reported here). > > > ------------------------------__------------------------------__-------------- > > > > *Hope you can help* > > > > > > Thanks. > > > > > > > > > > On Sun, Sep 22, 2013 at 2:28 PM, Ranjeev <ranj...@um.edu.my > <mailto:ranj...@um.edu.my> <mailto:ranj...@um.edu.my > <mailto:ranj...@um.edu.my>>> wrote: > > > > Btw, my assembly of the panda genome terminated with PBS: job > killed: node 2 (compute-0-4) requested job terminate, 'EOF' (code 1099) - > received SISTER_EOF attempting to communicate with sister MOM's. > > > > > > On Sun, Sep 22, 2013 at 12:21 PM, Ranjeev <ranj...@um.edu.my > <mailto:ranj...@um.edu.my> <mailto:ranj...@um.edu.my > <mailto:ranj...@um.edu.my>>> wrote: > > > > Hi > > > > Thanks I was using torque and specified the nodes only > with Infiniband and it seemed to have added those n parameters by itself. > Btw, if I have a node with 192Gb RAM and 24 processors, how many processes do > you recommend per node? I notice each process consuming considerable RAM and > start to use swap when I use 24 processes per node. > > > > > > > > On Fri, Sep 20, 2013 at 7:08 PM, Sébastien Boisvert > <sebastien.boisvert.3@ulaval.__ca <mailto:sebastien.boisver...@ulaval.ca> > <mailto:sebastien.boisvert.3@__ulaval.ca > <mailto:sebastien.boisver...@ulaval.ca>>> wrote: > > > > On 19/09/13 11:26 PM, Ranjeev wrote: > > > > HI, > > > > I'm using openmpi and torque - I managed to test > Ray once and worked fine (for a single node). > > The next time I run the network test takes > forever. Doesn't go beyond the last line > > > > Ray command: > > mpiexec -n 2 Ray \ > > -n \ > > 40 \ > > > > > > > > Do you want 2 ranks or 40 ranks ? > > > > > > > > -p \ > > /home/ranjeev/NTM/L007_R1.fq \ > > /home/ranjeev/NTM/L007_R2.fq \ > > -o \ > > /home/ranjeev/temp/Ray/test3 > > > > Rank 0 wrote > /home/ranjeev/temp/Ray/test3/____RayCommand.txt > > > > k-mer length: 21 > > Rank 1: assembler memory usage: 24972 KiB > > Rank 0: assembler memory usage: 24976 KiB > > Rank 1: assembler memory usage: 90796 KiB > > Rank 1: Rank= 1 Size= 2 ProcessIdentifier= 12290 > > Rank 0: assembler memory usage: 90804 KiB > > Rank 0: Rank= 0 Size= 2 ProcessIdentifier= 18150 > > Rank 0: testing the network, please wait... > > > > Rank 0 is testing the network [0/1000] > > > > > > Regards, > > Ranjeev > > > > > > > > -- > > " PENAFIAN: E-mel ini dan apa-apa fail yang > dikepilkan bersamanya ("Mesej") adalah ditujukan hanya untuk kegunaan > penerima(-penerima) yang termaklum di atas dan mungkin mengandungi maklumat > sulit. Anda dengan ini dimaklumkan bahawa mengambil apa jua tindakan > bersandarkan kepada, membuat penilaian, mengulang hantar, menghebah, > mengedar, mencetak, atau menyalin Mesej ini atau sebahagian daripadanya oleh > sesiapa selain daripada penerima(-penerima) yang termaklum di atas adalah > dilarang. Jika anda telah menerima Mesej ini kerana kesilapan, anda mesti > menghapuskan Mesej ini dengan segera dan memaklumkan kepada penghantar Mesej > ini menerusi balasan e-mel. Pendapat-pendapat, rumusan-rumusan, dan sebarang > maklumat lain di dalam Mesej ini yang tidak berkait dengan urusan rasmi > Universiti Malaya adalah difahami sebagai bukan dikeluar atau diperakui oleh > mana-mana pihak yang disebut. > > > > > > DISCLAIMER: This e-mail and any files transmitted > with it ("Message") is intended only for the use of the recipient(s) named > above and may contain confidential information. You are hereby notified that > the taking of any action in reliance upon, or any review, retransmission, > dissemination, distribution, printing or copying of this Message or any part > thereof by anyone other than the intended recipient(s) is strictly > prohibited. If you have received this Message in error, you should delete > this Message immediately and advise the sender by return e-mail. Opinions, > conclusions and other information in this Message that do not relate to the > official business of University of Malaya shall be understood as neither > given nor endorsed by any of the forementioned. " > > > > > > > > > > > -- > " PENAFIAN: E-mel ini dan apa-apa fail yang dikepilkan bersamanya > ("Mesej") > adalah ditujukan hanya untuk kegunaan penerima(-penerima) yang > termaklum di > atas dan mungkin mengandungi maklumat sulit. Anda dengan ini > dimaklumkan > bahawa mengambil apa jua tindakan bersandarkan kepada, membuat > penilaian, > mengulang hantar, menghebah, mengedar, mencetak, atau menyalin Mesej > ini > atau sebahagian daripadanya oleh sesiapa selain daripada > penerima(-penerima) yang termaklum di atas adalah dilarang. Jika anda > telah > menerima Mesej ini kerana kesilapan, anda mesti menghapuskan Mesej ini > dengan segera dan memaklumkan kepada penghantar Mesej ini menerusi > balasan > e-mel. Pendapat-pendapat, rumusan-rumusan, dan sebarang maklumat lain > di > dalam Mesej ini yang tidak berkait dengan urusan rasmi Universiti > Malaya > adalah difahami sebagai bukan dikeluar atau diperakui oleh mana-mana > pihak > yang disebut. > > > DISCLAIMER: This e-mail and any files transmitted with it ("Message") > is > intended only for the use of the recipient(s) named above and may > contain > confidential information. You are hereby notified that the taking of > any > action in reliance upon, or any review, retransmission, dissemination, > distribution, printing or copying of this Message or any part thereof > by > anyone other than the intended recipient(s) is strictly prohibited. > If you > have received this Message in error, you should delete this Message > immediately and advise the sender by return e-mail. Opinions, > conclusions > and other information in this Message that do not relate to the > official > business of University of Malaya shall be understood as neither given > nor > endorsed by any of the forementioned. " > > > ------------------------------------------------------------------------------ October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users