Re: [OMPI devel] IBCM error

2008-07-13 Thread Pavel Shamis (Pasha)

Fixed in https://svn.open-mpi.org/trac/ompi/changeset/18897

Is it any other know IBCM issue ?

Regards,
Pasha

Jeff Squyres wrote:
I think you said opposite things: Lenny's command line did not 
specifically ask for ibcm, but it was used anyway.  Lenny -- did you 
explicitly request it somewhere else (e.g., env var or MCA param file)?


I suspect that you did not; I suspect (without looking at the code 
again) that ibcm tried to select itself and failed on the 
ibcm_listen() call, so it fell back to oob.  This might have to be 
another workaround in OMPI, perhaps something like this:


if (ibcm_listen() fails)
   if (ibcm explicitly requested)
   print_warning()
   fail to use ibcm

Has this been filed as a bug at openfabrics.org?  I don't think that I 
filed it when Brad and I were testing on RoadRunner -- it would 
probably be good if someone filed it.




On Jul 13, 2008, at 8:56 AM, Lenny Verkhovsky wrote:


Pasha is right, I didn't disabled it.

On 7/13/08, Pavel Shamis (Pasha)  wrote: 
Jeff Squyres wrote:
Brad and I did some scale testing of IBCM and saw this error 
sometimes.  It seemed to happen with higher frequency when you 
increased the number of processes on a single node.


I talked to Sean Hefty about it, but we never figured out a 
definitive cause or solution.  My best guess is that there is 
something wonky about multiple processes simultaneously interacting 
with the IBCM kernel driver from userspace; but I don't know jack 
about kernel stuff, so that's a total SWAG.


Thanks for reminding me of this issue; I admit that I had forgotten 
about it.  :-(  Pasha -- should IBCM not be the default?

It is not default. I guess Lenny configured it explicitly, is not it ?

Pasha.





On Jul 13, 2008, at 7:08 AM, Lenny Verkhovsky wrote:

Hi,

I am getting this error sometimes.

/home/USERS/lenny/OMPI_COMP_PATH/bin/mpirun -np 100 -hostfile 
/home/USERS/lenny/TESTS/COMPILERS/hostfile 
/home/USERS/lenny/TESTS/COMPILERS/hello
[witch24][[32428,1],96][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_ibcm.c:769:ibcm_component_query] 
failed to ib_cm_listen 10 times: rc=-1, errno=22

Hello world! I'm 0 of 100 on witch2


Best Regards

Lenny.


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel







Re: [OMPI devel] IBCM error

2008-07-13 Thread Jeff Squyres
I think you said opposite things: Lenny's command line did not  
specifically ask for ibcm, but it was used anyway.  Lenny -- did you  
explicitly request it somewhere else (e.g., env var or MCA param file)?


I suspect that you did not; I suspect (without looking at the code  
again) that ibcm tried to select itself and failed on the  
ibcm_listen() call, so it fell back to oob.  This might have to be  
another workaround in OMPI, perhaps something like this:


if (ibcm_listen() fails)
   if (ibcm explicitly requested)
   print_warning()
   fail to use ibcm

Has this been filed as a bug at openfabrics.org?  I don't think that I  
filed it when Brad and I were testing on RoadRunner -- it would  
probably be good if someone filed it.




On Jul 13, 2008, at 8:56 AM, Lenny Verkhovsky wrote:


Pasha is right, I didn't disabled it.

On 7/13/08, Pavel Shamis (Pasha)  wrote:  
Jeff Squyres wrote:
Brad and I did some scale testing of IBCM and saw this error  
sometimes.  It seemed to happen with higher frequency when you  
increased the number of processes on a single node.


I talked to Sean Hefty about it, but we never figured out a  
definitive cause or solution.  My best guess is that there is  
something wonky about multiple processes simultaneously interacting  
with the IBCM kernel driver from userspace; but I don't know jack  
about kernel stuff, so that's a total SWAG.


Thanks for reminding me of this issue; I admit that I had forgotten  
about it.  :-(  Pasha -- should IBCM not be the default?

It is not default. I guess Lenny configured it explicitly, is not it ?

Pasha.





On Jul 13, 2008, at 7:08 AM, Lenny Verkhovsky wrote:

Hi,

I am getting this error sometimes.

/home/USERS/lenny/OMPI_COMP_PATH/bin/mpirun -np 100 -hostfile /home/ 
USERS/lenny/TESTS/COMPILERS/hostfile /home/USERS/lenny/TESTS/ 
COMPILERS/hello
[witch24][[32428,1],96][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_ibcm.c:769:ibcm_component_query] failed to  
ib_cm_listen 10 times: rc=-1, errno=22

Hello world! I'm 0 of 100 on witch2


Best Regards

Lenny.


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] IBCM error

2008-07-13 Thread Lenny Verkhovsky
Pasha is right, I didn't disabled it.

On 7/13/08, Pavel Shamis (Pasha)  wrote:
>
> Jeff Squyres wrote:
>
>> Brad and I did some scale testing of IBCM and saw this error sometimes.
>>  It seemed to happen with higher frequency when you increased the number of
>> processes on a single node.
>>
>> I talked to Sean Hefty about it, but we never figured out a definitive
>> cause or solution.  My best guess is that there is something wonky about
>> multiple processes simultaneously interacting with the IBCM kernel driver
>> from userspace; but I don't know jack about kernel stuff, so that's a total
>> SWAG.
>>
>> Thanks for reminding me of this issue; I admit that I had forgotten about
>> it.  :-(  Pasha -- should IBCM not be the default?
>>
> It is not default. I guess Lenny configured it explicitly, is not it ?
>
> Pasha.
>
>
>>
>>
>> On Jul 13, 2008, at 7:08 AM, Lenny Verkhovsky wrote:
>>
>>  Hi,
>>>
>>> I am getting this error sometimes.
>>>
>>> /home/USERS/lenny/OMPI_COMP_PATH/bin/mpirun -np 100 -hostfile
>>> /home/USERS/lenny/TESTS/COMPILERS/hostfile
>>> /home/USERS/lenny/TESTS/COMPILERS/hello
>>> [witch24][[32428,1],96][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_ibcm.c:769:ibcm_component_query]
>>> failed to ib_cm_listen 10 times: rc=-1, errno=22
>>> Hello world! I'm 0 of 100 on witch2
>>>
>>>
>>> Best Regards
>>>
>>> Lenny.
>>>
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] IBCM error

2008-07-13 Thread Pavel Shamis (Pasha)

Jeff Squyres wrote:
Brad and I did some scale testing of IBCM and saw this error 
sometimes.  It seemed to happen with higher frequency when you 
increased the number of processes on a single node.


I talked to Sean Hefty about it, but we never figured out a definitive 
cause or solution.  My best guess is that there is something wonky 
about multiple processes simultaneously interacting with the IBCM 
kernel driver from userspace; but I don't know jack about kernel 
stuff, so that's a total SWAG.


Thanks for reminding me of this issue; I admit that I had forgotten 
about it.  :-(  Pasha -- should IBCM not be the default?

It is not default. I guess Lenny configured it explicitly, is not it ?

Pasha.





On Jul 13, 2008, at 7:08 AM, Lenny Verkhovsky wrote:


Hi,

I am getting this error sometimes.

/home/USERS/lenny/OMPI_COMP_PATH/bin/mpirun -np 100 -hostfile 
/home/USERS/lenny/TESTS/COMPILERS/hostfile 
/home/USERS/lenny/TESTS/COMPILERS/hello
[witch24][[32428,1],96][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_ibcm.c:769:ibcm_component_query] 
failed to ib_cm_listen 10 times: rc=-1, errno=22

Hello world! I'm 0 of 100 on witch2


Best Regards

Lenny.


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel







Re: [OMPI devel] IBCM error

2008-07-13 Thread Jeff Squyres
Brad and I did some scale testing of IBCM and saw this error  
sometimes.  It seemed to happen with higher frequency when you  
increased the number of processes on a single node.


I talked to Sean Hefty about it, but we never figured out a definitive  
cause or solution.  My best guess is that there is something wonky  
about multiple processes simultaneously interacting with the IBCM  
kernel driver from userspace; but I don't know jack about kernel  
stuff, so that's a total SWAG.


Thanks for reminding me of this issue; I admit that I had forgotten  
about it.  :-(  Pasha -- should IBCM not be the default?




On Jul 13, 2008, at 7:08 AM, Lenny Verkhovsky wrote:


Hi,

I am getting this error sometimes.

/home/USERS/lenny/OMPI_COMP_PATH/bin/mpirun -np 100 -hostfile /home/ 
USERS/lenny/TESTS/COMPILERS/hostfile /home/USERS/lenny/TESTS/ 
COMPILERS/hello
[witch24][[32428,1],96][../../../../../ompi/mca/btl/openib/connect/ 
btl_openib_connect_ibcm.c:769:ibcm_component_query] failed to  
ib_cm_listen 10 times: rc=-1, errno=22

Hello world! I'm 0 of 100 on witch2


Best Regards

Lenny.


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



[OMPI devel] IBCM error

2008-07-13 Thread Lenny Verkhovsky
Hi,

I am getting this error sometimes.

/home/USERS/lenny/OMPI_COMP_PATH/bin/mpirun -np 100 -hostfile
/home/USERS/lenny/TESTS/COMPILERS/hostfile
/home/USERS/lenny/TESTS/COMPILERS/hello
[witch24][[32428,1],96][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_ibcm.c:769:ibcm_component_query]
failed to ib_cm_listen 10 times: rc=-1, errno=22
Hello world! I'm 0 of 100 on witch2

Best Regards

Lenny.