Max, The recursive call should not be an issue, as the MPI_Allreduce is a blocking operation, you can't recurse before the previous call completes.
What is the size of the data exchanged in the MPI_Alltoall? George. On Sep 30, 2013, at 17:09 , Max Staufer <max.stau...@gmx.net> wrote: > Well, havent tryed 1.7.2 yet, but too elaborate the problem a little bit more, > > the groth happens if we use an MPI_ALLREDUCE in a recursive subroutine call, > that means in FORTRAN90 speech the > subroutine calls itself again, and is specially marked in order to work > properly. Apart from that nothing is special > with this routine. Is it possible that the F77 interface in Openmpi is not > able to cope with recursions ? > > MAX > > > > Am 13.09.13 17:18, schrieb Rolf vandeVaart: >> Yes, it appears the send_requests list is the one that is growing. This >> list holds the send request structures that are in use. After a send is >> completed, a send request is supposed to be returned to this list and then >> get re-used. >> >> With 7 processes, it had reached a size of 16,324 send requests in use. >> With the 8 processes, it had reached a size of 16,708. Each send request is >> 720 bytes (in debug build it is 872) and if we do the math we have consumed >> about 12 Mbytes. >> >> Setting some type of bound will not fix this issue. There is something else >> going on here that is causing this problem. I know you described the >> problem earlier on, but maybe you can explain again? How many processes? >> What type of cluster? One other thought is perhaps trying Open MPI 1.7.2 >> to see if you still see the problem. Maybe someone else has suggestions >> too. >> >> Rolf >> >> PS: For those who missed a private email, I had Max add some instrumentation >> so we could see which list was growing. We now know it is the >> mca_pml_base_send_requests list. >> >>> -----Original Message----- >>> From: Max Staufer [mailto:max.stau...@gmx.net] >>> Sent: Friday, September 13, 2013 7:06 AM >>> To: Rolf vandeVaart;de...@open-mpi.org >>> Subject: Re: [OMPI devel] Nearly unlimited growth of pml free list >>> >>> Hi Rolf, >>> >>> I applied your patch, the full output is rather big, even gzip > 10Mb, >>> which is >>> not good for the mailinglist, but the head and tail are below for a 7 and 8 >>> processor run. >>> Seem that the send requests are growing fast 4000 times in just 10 min. >>> >>> Do you now of a method to bound the list such that it is not growing >>> excessivly >>> ? >>> >>> thanks >>> >>> Max >>> >>> 7 Processor run >>> ------------------ >>> [gpu207.dev-env.lan:11236] Iteration = 0 sleeping [gpu207.dev-env.lan:11236] >>> Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 [gpu207.dev-env.lan:11236] >>> Freelist=recv_frags, numAlloc=4, maxAlloc=-1 [gpu207.dev-env.lan:11236] >>> Freelist=pending_pckts, numAlloc=4, maxAlloc=-1 [gpu207.dev- >>> env.lan:11236] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11236] Freelist=send_requests, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11236] Freelist=recv_requests, numAlloc=4, >>> maxAlloc=-1 [gpu207.dev-env.lan:11236] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev- >>> env.lan:11236] [gpu207.dev-env.lan:11236] Iteration = 0 sleeping >>> [gpu207.dev-env.lan:11236] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11236] Freelist=recv_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11236] Freelist=pending_pckts, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11236] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11236] Freelist=send_requests, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11236] Freelist=recv_requests, numAlloc=4, >>> maxAlloc=-1 [gpu207.dev-env.lan:11236] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev- >>> env.lan:11236] [gpu207.dev-env.lan:11236] Iteration = 0 sleeping >>> [gpu207.dev-env.lan:11236] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11236] Freelist=recv_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11236] Freelist=pending_pckts, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11236] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11236] Freelist=send_requests, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11236] Freelist=recv_requests, numAlloc=4, >>> maxAlloc=-1 [gpu207.dev-env.lan:11236] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev- >>> env.lan:11236] [gpu207.dev-env.lan:11236] Iteration = 0 sleeping >>> [gpu207.dev-env.lan:11236] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11236] Freelist=recv_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11236] Freelist=pending_pckts, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11236] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11236] Freelist=send_requests, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11236] Freelist=recv_requests, numAlloc=4, >>> maxAlloc=-1 [gpu207.dev-env.lan:11236] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev- >>> env.lan:11236] [gpu207.dev-env.lan:11236] Iteration = 0 sleeping >>> [gpu207.dev-env.lan:11236] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11236] Freelist=recv_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11236] Freelist=pending_pckts, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11236] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11236] Freelist=send_requests, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11236] Freelist=recv_requests, numAlloc=4, >>> maxAlloc=-1 [gpu207.dev-env.lan:11236] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 >>> >>> >>> ...... >>> >>> [gpu207.dev-env.lan:11243] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=send_requests, numAlloc=16324, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=recv_requests, numAlloc=68, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11243] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev- >>> env.lan:11243] [gpu207.dev-env.lan:11243] Iteration = 0 sleeping >>> [gpu207.dev-env.lan:11243] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=recv_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=pending_pckts, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11243] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=send_requests, numAlloc=16324, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=recv_requests, numAlloc=68, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11243] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev- >>> env.lan:11243] [gpu207.dev-env.lan:11243] Iteration = 0 sleeping >>> [gpu207.dev-env.lan:11243] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=recv_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=pending_pckts, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11243] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=send_requests, numAlloc=16324, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=recv_requests, numAlloc=68, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11243] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev- >>> env.lan:11243] [gpu207.dev-env.lan:11243] Iteration = 0 sleeping >>> [gpu207.dev-env.lan:11243] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=recv_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=pending_pckts, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11243] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=send_requests, numAlloc=16324, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=recv_requests, numAlloc=68, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11243] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev- >>> env.lan:11243] [gpu207.dev-env.lan:11243] Iteration = 0 sleeping >>> [gpu207.dev-env.lan:11243] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=recv_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=pending_pckts, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11243] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=send_requests, numAlloc=16324, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=recv_requests, numAlloc=68, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11243] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev- >>> env.lan:11243] [gpu207.dev-env.lan:11243] Iteration = 0 sleeping >>> [gpu207.dev-env.lan:11243] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=recv_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=pending_pckts, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11243] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=send_requests, numAlloc=16324, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11243] Freelist=recv_requests, numAlloc=68, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11243] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 >>> >>> >>> 8 Processor run >>> -------------------- >>> >>> [gpu207.dev-env.lan:11315] Iteration = 0 sleeping [gpu207.dev-env.lan:11315] >>> Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 [gpu207.dev-env.lan:11315] >>> Freelist=recv_frags, numAlloc=4, maxAlloc=-1 [gpu207.dev-env.lan:11315] >>> Freelist=pending_pckts, numAlloc=4, maxAlloc=-1 [gpu207.dev- >>> env.lan:11315] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11315] Freelist=send_requests, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11315] Freelist=recv_requests, numAlloc=4, >>> maxAlloc=-1 [gpu207.dev-env.lan:11315] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev- >>> env.lan:11315] [gpu207.dev-env.lan:11315] Iteration = 0 sleeping >>> [gpu207.dev-env.lan:11315] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11315] Freelist=recv_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11315] Freelist=pending_pckts, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11315] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11315] Freelist=send_requests, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11315] Freelist=recv_requests, numAlloc=4, >>> maxAlloc=-1 [gpu207.dev-env.lan:11315] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev- >>> env.lan:11315] [gpu207.dev-env.lan:11315] Iteration = 0 sleeping >>> [gpu207.dev-env.lan:11315] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11315] Freelist=recv_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11315] Freelist=pending_pckts, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11315] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11315] Freelist=send_requests, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11315] Freelist=recv_requests, numAlloc=4, >>> maxAlloc=-1 [gpu207.dev-env.lan:11315] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev- >>> env.lan:11315] [gpu207.dev-env.lan:11315] Iteration = 0 sleeping >>> [gpu207.dev-env.lan:11315] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11315] Freelist=recv_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11315] Freelist=pending_pckts, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11315] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11315] Freelist=send_requests, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11315] Freelist=recv_requests, numAlloc=4, >>> maxAlloc=-1 [gpu207.dev-env.lan:11315] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev- >>> env.lan:11315] [gpu207.dev-env.lan:11315] Iteration = 0 sleeping >>> [gpu207.dev-env.lan:11315] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11315] Freelist=recv_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11315] Freelist=pending_pckts, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11315] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11315] Freelist=send_requests, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11315] Freelist=recv_requests, numAlloc=4, >>> maxAlloc=-1 [gpu207.dev-env.lan:11315] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 >>> >>> >>> ... >>> >>> [gpu207.dev-env.lan:11322] Iteration = 0 sleeping [gpu207.dev-env.lan:11322] >>> Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 [gpu207.dev-env.lan:11322] >>> Freelist=recv_frags, numAlloc=4, maxAlloc=-1 [gpu207.dev-env.lan:11322] >>> Freelist=pending_pckts, numAlloc=4, maxAlloc=-1 [gpu207.dev- >>> env.lan:11322] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11322] Freelist=send_requests, numAlloc=16708, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11322] Freelist=recv_requests, numAlloc=68, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11322] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev- >>> env.lan:11322] [gpu207.dev-env.lan:11322] Iteration = 0 sleeping >>> [gpu207.dev-env.lan:11322] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11322] Freelist=recv_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11322] Freelist=pending_pckts, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11322] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11322] Freelist=send_requests, numAlloc=16708, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11322] Freelist=recv_requests, numAlloc=68, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11322] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev- >>> env.lan:11322] [gpu207.dev-env.lan:11322] Iteration = 0 sleeping >>> [gpu207.dev-env.lan:11322] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11322] Freelist=recv_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11322] Freelist=pending_pckts, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11322] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11322] Freelist=send_requests, numAlloc=16708, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11322] Freelist=recv_requests, numAlloc=68, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11322] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev- >>> env.lan:11322] [gpu207.dev-env.lan:11322] Iteration = 0 sleeping >>> [gpu207.dev-env.lan:11322] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11322] Freelist=recv_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11322] Freelist=pending_pckts, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11322] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11322] Freelist=send_requests, numAlloc=16708, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11322] Freelist=recv_requests, numAlloc=68, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11322] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev- >>> env.lan:11322] [gpu207.dev-env.lan:11322] Iteration = 0 sleeping >>> [gpu207.dev-env.lan:11322] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11322] Freelist=recv_frags, numAlloc=4, maxAlloc=-1 >>> [gpu207.dev-env.lan:11322] Freelist=pending_pckts, numAlloc=4, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11322] Freelist=send_ranges_pckts, numAlloc=4, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11322] Freelist=send_requests, numAlloc=16708, >>> maxAlloc=-1 >>> [gpu207.dev-env.lan:11322] Freelist=recv_requests, numAlloc=68, maxAlloc=- >>> 1 [gpu207.dev-env.lan:11322] rdma_pending=0, pckt_pending=0, >>> recv_pending=0, send_pending=0, comm_pending=0 >>> >>> >>> >>> Am 12.09.2013 17:04, schrieb Rolf vandeVaart: >>>> Can you apply this patch and try again? It will print out the sizes of >>>> the free >>> lists after every 100 calls into the mca_pml_ob1_send. It would be >>> interesting >>> to see which one is growing. >>>> This might give us some clues. >>>> >>>> Rolf >>>> >>>>> -----Original Message----- >>>>> From: Max Staufer [mailto:max.stau...@gmx.net] >>>>> Sent: Thursday, September 12, 2013 3:53 AM >>>>> To: Rolf vandeVaart >>>>> Subject: Re: [OMPI devel] Nearly unlimited growth of pml free list >>>>> >>>>> Hi Rolf, >>>>> >>>>> the heap snapshots I do tell me where and when the memory has >>>>> been allocated, and a simple source trace of the in tells me that the >>>>> calling >>>>> >>>>> routine was mca_pml_ob1_send and that all of the ~100000 single >>>>> allocations during the run were called because of an MPI_ALLREDUCE >>>>> command called in exactly one place of the code. >>>>> The tool I use for doing that is MemorySCAPE but I thing Valgrind can >>>>> tell you the same thing. However, I was not able to reproduce the >>>>> problem in a simpler program yet, but I suspect it has something to >>>>> do with the locking mechanism of the list elements. I dont know >>>>> enough about OMPI to comment on that, but it looks like that the list >>>>> is growing because all elements are locked. >>>>> >>>>> really any help is appreciated >>>>> >>>>> Max >>>>> >>>>> PS: >>>>> >>>>> IF I MIMICK ALLREDUCE with 2*Nproc SEND and RECV commands >>>>> (aggregating on proc 0 and then sending out to all Proc) I get the same >>>>> kind >>> of behaviour. >>>>> Am 11.09.2013 17:12, schrieb Rolf vandeVaart: >>>>>> Hi Max: >>>>>> You say that that the function keeps "allocating memory in the pml free >>> list." >>>>> How do you know that is happening? >>>>>> Do you know which free list it is happening on? There are something >>>>>> like 8 >>>>> free lists associated with the pml ob1 so it would be interesting to >>>>> know which one you observe is growing. >>>>>> Rolf >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Max >>>>>>> Staufer >>>>>>> Sent: Wednesday, September 11, 2013 10:23 AM >>>>>>> To:de...@open-mpi.org >>>>>>> Subject: [OMPI devel] Nearly unlimited growth of pml free list >>>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> as I already asked in the users list, I was told thats not >>>>>>> the right place to ask, I came across a "missbehaviour" of openmpi >>>>>>> version >>>>> 1.4.5 and 1.6.5 alike. >>>>>>> the mca_pml_ob1_send function keeps allocating memory in the pml >>>>>>> free >>>>> list. >>>>>>> It does that indefinitly. In my case the list grew to about 100Gb. >>>>>>> >>>>>>> I can controll the maximum using the pml_ob1_free_list_max >>>>>>> parameter, but then the application just stops working when this >>>>>>> number of entries in the list is reached. >>>>>>> >>>>>>> The interesting part is that the growth only happens in a single >>>>>>> place in the code, which is RECURSIVE SUBROUTINE. >>>>>>> >>>>>>> And the called function is an MPI_ALLREDUCE(... MPI_SUM) >>>>>>> >>>>>>> Apparently its not easy to create a test program that shows the >>>>>>> same behaviour, just recursion is not enought. >>>>>>> >>>>>>> Is there a mca parameter that allows to limit the total list size >>>>>>> without making the app. stop ? >>>>>>> >>>>>>> or is there a way to enforce the lock on the free list entries ? >>>>>>> >>>>>>> Thanks for all the help >>>>>>> >>>>>>> Max >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> -------------------------------------------------------------------- >>>>>> -- >>>>>> ------------- This email message is for the sole use of the intended >>>>>> recipient(s) and may contain confidential information. Any >>>>>> unauthorized review, use, disclosure or distribution is prohibited. >>>>>> If you are not the intended recipient, please contact the sender by >>>>>> reply email and destroy all copies of the original message. >>>>>> -------------------------------------------------------------------- >>>>>> -- >>>>>> ------------- > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel