Sorry,
I checked it without sm.

pls ignore this mail.



On Thu, Jun 19, 2008 at 4:32 PM, Lenny Verkhovsky <
lenny.verkhov...@gmail.com> wrote:

> Hi,
> I found what caused the problem in both cases.
>
> --- ompi/mca/btl/sm/btl_sm.c    (revision 18675)
> +++ ompi/mca/btl/sm/btl_sm.c    (working copy)
> @@ -812,7 +812,7 @@
>       */
>      MCA_BTL_SM_FIFO_WRITE(endpoint, endpoint->my_smp_rank,
>                            endpoint->peer_smp_rank, frag->hdr, false, rc);
> -    return (rc < 0 ? rc : 1);
> +   return OMPI_SUCCESS;
>  }
> I am just not sure if it's OK.
>
> Lenny.
>   On Wed, Jun 18, 2008 at 3:21 PM, Lenny Verkhovsky <
> lenny.verkhov...@gmail.com> wrote:
>
>> Hi,
>> I am not sure if it related,
>> but I applied your patch ( r18667 )  to r 18656 ( one before NUMA )
>> together with disabling sendi,
>> The result still the same ( hanging ).
>>
>>
>>
>>
>>  On Tue, Jun 17, 2008 at 2:10 PM, George Bosilca <bosi...@eecs.utk.edu>
>> wrote:
>>
>>> Lenny,
>>>
>>> I guess you're running the latest version. If not, please update, Galen
>>> and myself corrected some bugs last week. If you're using the latest (and
>>> greatest) then ... well I imagine there is at least one bug left.
>>>
>>> There is a quick test you can do. In the btl_sm.c in the module structure
>>> at the beginning of the file, please replace the sendi function by NULL. If
>>> this fix the problem, then at least we know that it's a sm send immediate
>>> problem.
>>>
>>>  Thanks,
>>>    george.
>>>
>>>
>>> On Jun 17, 2008, at 7:54 AM, Lenny Verkhovsky wrote:
>>>
>>> Hi, George,
>>>>
>>>> I have a problem running BW benchmark on 100 rank cluster after r18551.
>>>> The BW is mpi_p that runs mpi_bandwidth with 100K between all pairs.
>>>>
>>>>
>>>> #mpirun -np 100 -hostfile hostfile_w  ./mpi_p_18549 -t bw -s 100000
>>>> BW (100) (size min max avg)  100000     576.734030      2001.882416
>>>> 1062.698408
>>>> #mpirun -np 100 -hostfile hostfile_w ./mpi_p_18551 -t bw -s 100000
>>>> mpirun: killing job...
>>>> ( it hangs even after 10 hours ).
>>>>
>>>>
>>>> It doesn't happen if I run --bynode or btl openib,self only.
>>>>
>>>>
>>>> Lenny.
>>>>
>>>
>>>
>>
>

Reply via email to