I run all the ex-Pallas test and the same error happens. We try to malloc
0 bytes and we hang somewhere. Let me explain what I found. First of all,
most of the tests seems to work perfectly (at least with the PTL/BTL I was
able to run: sm, tcp, mx). The deadlock as well as the memory allocation
problem happens in the reduce_scatter operation.

Problem 1: allocating 0 bytes
- it's not a datatype problem. The datatype return the correct extent,
  true_extent, lb. The problem is that we miss one case in the collective
  communications. How about the case when the user do a reduce_scatter
  with all the counts set to zero ? We check if they are greater than zero
  and it's the case. Then we add them together and as expected a sum of
  zero is zero. So in the coll_basic_reduce_scatter line 79 we will
  allocate zero bytes because the extent and the true_extent of the
  MPI_FLOAT datatype are equal and (count - 1) is -1 !!! There is a simple
  fix for this problem, if count == 0 then free_buffer should be set to
  NULL (as we don't send or receive anything in this buffer it will just
  work fine) at the PTL/PML level.
- the same error can happens on the reduce function if the count is zero.
  I will protect this function too.

Problem 2: hanging
- somehow a strange optimization get inside the scatterv function. In the
  case where the sender has to send zero bytes it completly skip the send
  operation. But the receiver still expect to get a message. Anyway, this
  optimization is not correct, all messages have to be send. I know that
  it can (slightly) increase the time for the collective but it give us a
  simple way of checking the correctness of the global communication (as
  the PML handle the truncate case). Patch is on the way.

Once these 2 problems corrected we pass all the Pallas MPI1 tests. I run
the tests with the PML ob1, teg and uniq and the PTL/BTL sm, tcp, gm (PTL)
and mx(PTL) with 2 and 8 processes.

  george.

PS: the patches will be commited soon.

> On Aug 9, 2005, at 1:53 PM, Galen Shipman wrote:
>
>       Hi Sridhar,
>
> I have committed changes that allow you to set the debg verbosity,
>
> OMPI_MCA_btl_base_debug
> 0 - no debug output
> 1 - standard debug output
> 2 - very verbose debug output
>
> Also we have run the Pallas tests and are not able to reproduce your 
> failures. We do see a warning in
> the Reduce test but it does not hang and runs to completion. Attached is a 
> simple ping pong program,
> try running this and let us know the results.
>
> Thanks,
>
> Galen
>
>
>
> <mpi-ping.c>
>
> On Aug 9, 2005, at 8:15 AM, Sridhar Chirravuri wrote:
>
>
>       The same kind of output while running Pallas "pingpong" test.
>
> -Sridhar
>
> -----Original Message-----
> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On
> Behalf Of Sridhar Chirravuri
> Sent: Tuesday, August 09, 2005 7:44 PM
> To: Open MPI Developers
> Subject: Re: [O-MPI devel] Fwd: Regarding MVAPI Component in Open MPI
>
>
> I have run sendrecv function in Pallas but it failed to run it. Here is
> the output
>
> [root@micrompi-2 SRC_PMB]# mpirun -np 2 PMB-MPI1 sendrecv
> Could not join a running, existing universe
> Establishing a new one named: default-universe-5097
> [0,1,1][btl_mvapi.c:130:mca_btl_mvapi_del_procs] Stub
> [0,1,1][btl_mvapi.c:130:mca_btl_mvapi_del_procs] Stub
>
>
> [0,1,0][btl_mvapi.c:130:mca_btl_mvapi_del_procs] Stub
>
> [0,1,0][btl_mvapi.c:130:mca_btl_mvapi_del_procs] Stub
>
> [0,1,0][btl_mvapi_endpoint.c:542:mca_btl_mvapi_endpoint_send] Connection
> to endpoint closed ... connecting ...
> [0,1,0][btl_mvapi_endpoint.c:318:mca_btl_mvapi_endpoint_start_connect]
> Initialized High Priority QP num = 263177, Low Priority QP num = 263178,
> LID = 785
>
> [0,1,0][btl_mvapi_endpoint.c:190:mca_btl_mvapi_endpoint_send_connect_req
> ] Sending High Priority QP num = 263177, Low Priority QP num = 263178,
> LID = 785[0,1,0][btl_mvapi_endpoint.c:542:mca_btl_mvapi_endpoint_send]
> Connection to endpoint closed ... connecting ...
> [0,1,0][btl_mvapi_endpoint.c:318:mca_btl_mvapi_endpoint_start_connect]
> Initialized High Priority QP num = 263179, Low Priority QP num = 263180,
> LID = 786
>
> [0,1,0][btl_mvapi_endpoint.c:190:mca_btl_mvapi_endpoint_send_connect_req
> ] Sending High Priority QP num = 263179, Low Priority QP num = 263180,
> LID = 786#---------------------------------------------------
> #    PALLAS MPI Benchmark Suite V2.2, MPI-1 part
> #---------------------------------------------------
> # Date       : Tue Aug  9 07:11:25 2005
> # Machine    : x86_64# System     : Linux
> # Release    : 2.6.9-5.ELsmp
> # Version    : #1 SMP Wed Jan 5 19:29:47 EST 2005
>
> #
> # Minimum message length in bytes:   0
> # Maximum message length in bytes:   4194304
> #
> # MPI_Datatype                   :   MPI_BYTE
> # MPI_Datatype for reductions    :   MPI_FLOAT
> # MPI_Op                         :   MPI_SUM
> #
> #
>
> # List of Benchmarks to run:
>
> # Sendrecv
> [0,1,1][btl_mvapi_endpoint.c:368:mca_btl_mvapi_endpoint_reply_start_conn
> ect] Initialized High Priority QP num = 263177, Low Priority QP num =
> 263178,  LID = 777
>
> [0,1,1][btl_mvapi_endpoint.c:266:mca_btl_mvapi_endpoint_set_remote_info]
> Received High Priority QP num = 263177, Low Priority QP num 263178,  LID
> = 785
>
> [0,1,1][btl_mvapi_endpoint.c:756:mca_btl_mvapi_endpoint_qp_init_query]
> Modified to init..Qp
> 7080096[0,1,1][btl_mvapi_endpoint.c:791:mca_btl_mvapi_endpoint_qp_init_q
> uery] Modified to RTR..Qp
> 7080096[0,1,1][btl_mvapi_endpoint.c:814:mca_btl_mvapi_endpoint_qp_init_q
> uery] Modified to RTS..Qp 7080096
>
> [0,1,1][btl_mvapi_endpoint.c:756:mca_btl_mvapi_endpoint_qp_init_query]
> Modified to init..Qp 7240736
> [0,1,1][btl_mvapi_endpoint.c:791:mca_btl_mvapi_endpoint_qp_init_query]
> Modified to RTR..Qp
> 7240736[0,1,1][btl_mvapi_endpoint.c:814:mca_btl_mvapi_endpoint_qp_init_q
> uery] Modified to RTS..Qp 7240736
> [0,1,1][btl_mvapi_endpoint.c:190:mca_btl_mvapi_endpoint_send_connect_req
> ] Sending High Priority QP num = 263177, Low Priority QP num = 263178,
> LID = 777
> [0,1,0][btl_mvapi_endpoint.c:266:mca_btl_mvapi_endpoint_set_remote_info]
> Received High Priority QP num = 263177, Low Priority QP num 263178,  LID
> = 777
> [0,1,0][btl_mvapi_endpoint.c:756:mca_btl_mvapi_endpoint_qp_init_query]
> Modified to init..Qp 7081440
> [0,1,0][btl_mvapi_endpoint.c:791:mca_btl_mvapi_endpoint_qp_init_query]
> Modified to RTR..Qp 7081440
> [0,1,0][btl_mvapi_endpoint.c:814:mca_btl_mvapi_endpoint_qp_init_query]
> Modified to RTS..Qp 7081440
> [0,1,0][btl_mvapi_endpoint.c:756:mca_btl_mvapi_endpoint_qp_init_query]
> Modified to init..Qp 7241888
> [0,1,0][btl_mvapi_endpoint.c:791:mca_btl_mvapi_endpoint_qp_init_query]
> Modified to RTR..Qp
> 7241888[0,1,0][btl_mvapi_endpoint.c:814:mca_btl_mvapi_endpoint_qp_init_q
> uery] Modified to RTS..Qp 7241888
> [0,1,1][btl_mvapi_component.c:523:mca_btl_mvapi_component_progress] Got
> a recv completion
>
>
> Thanks
> -Sridhar
>
>
>
>
> -----Original Message-----
> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On
> Behalf Of Brian Barrett
> Sent: Tuesday, August 09, 2005 7:35 PM
> To: Open MPI Developers
> Subject: Re: [O-MPI devel] Fwd: Regarding MVAPI Component in Open MPI
>
> On Aug 9, 2005, at 8:48 AM, Sridhar Chirravuri wrote:
>
>
>       Does r6774 has lot of changes that are related to 3rd generation
> point-to-point? I am trying to run some benchmark tests (ex:
> pallas) with Open MPI stack and just want to compare the
> performance figures with MVAPICH 095 and MVAPICH 092.
>
> In order to use 3rd generation p2p communication, I have added the
> following line in the /openmpi/etc/openmpi-mca-params.conf
>
> pml=ob1
>
> I also exported (as double check) OMPI_MCA_pml=ob1.
>
> Then, I have tried running on the same machine. My machine has got
> 2 processors.
>
> Mpirun -np 2 ./PMB-MPI1
>
> I still see the following lines
>
> Request for 0 bytes (coll_basic_reduce_scatter.c, 79)
> Request for 0 bytes (coll_basic_reduce.c, 193)
> Request for 0 bytes (coll_basic_reduce_scatter.c, 79)
> Request for 0 bytes (coll_basic_reduce.c, 193)
>
>
> These errors are coming from the collective routines, not the PML/BTL
> layers.  It looks like the reduction codes are trying to call malloc
> (0), which doesn't work so well.  We'll take a look as soon as we
> can.  In the mean time, can you just not run the tests that call the
> reduction collectives?
>
> Brian
>
>
> -- 
>    Brian Barrett
>    Open MPI developer
>    http://www.open-mpi.org/
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
>

"We must accept finite disappointment, but we must never lose infinite
hope."
                                  Martin Luther King


Reply via email to