Sorry to bring this back up.
We recently had an outage updated the firmware on our GD4700 and installed a
new mellonox provided OFED stack and the problem has returned.
Specifically I am able to produce the problem with IMB 4 12 core nodes when it
tries to go to 16 cores. I have verified that en
Brock Palen writes:
> Well I have a new wrench into this situation.
> We have a power failure at our datacenter took down our entire system
> nodes,switch,sm.
> Now I am unable to produce the error with oob default ibflags etc.
As far as I know, we could still reproduce it. Mail me if you ne
Well I have a new wrench into this situation.
We have a power failure at our datacenter took down our entire system
nodes,switch,sm.
Now I am unable to produce the error with oob default ibflags etc.
Does this shed any light on the issue? It also makes it hard to now debug the
issue without b
Sorry typo 314 not 313,
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985
On May 17, 2011, at 2:02 PM, Brock Palen wrote:
> Thanks, I though of looking at ompi_info after I sent that note sigh.
>
> SEND_INPLACE appears to help performance of large
Thanks, I though of looking at ompi_info after I sent that note sigh.
SEND_INPLACE appears to help performance of larger messages in my synthetic
benchmarks over regular SEND. Also it appears that SEND_INPLACE still allows
our code to run.
We working on getting devs access to our system and co
Here is the output of the "ompi_info --param btl openib":
MCA btl: parameter "btl_openib_flags" (current value: <306>,
data
source: default value)
BTL bit flags (general flags: SEND=1, PUT=2, GET=4,
SEN
On May 16, 2011, at 8:53 AM, Brock Palen wrote:
>
>
>
> On May 16, 2011, at 10:23 AM, Samuel K. Gutierrez wrote:
>
>> Hi,
>>
>> Just out of curiosity - what happens when you add the following MCA option
>> to your openib runs?
>>
>> -mca btl_openib_flags 305
>
> You Sir found the magic co
On May 16, 2011, at 10:23 AM, Samuel K. Gutierrez wrote:
> Hi,
>
> Just out of curiosity - what happens when you add the following MCA option to
> your openib runs?
>
> -mca btl_openib_flags 305
You Sir found the magic combination.
I verified this lets IMB and CRASH progress pass their lock
Hi,
Just out of curiosity - what happens when you add the following MCA option to
your openib runs?
-mca btl_openib_flags 305
Thanks,
Samuel Gutierrez
Los Alamos National Laboratory
On May 13, 2011, at 2:38 PM, Brock Palen wrote:
> On May 13, 2011, at 4:09 PM, Dave Love wrote:
>
>> Jeff Squ
On May 13, 2011, at 4:09 PM, Dave Love wrote:
> Jeff Squyres writes:
>
>> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>>
>>> We can reproduce it with IMB. We could provide access, but we'd have to
>>> negotiate with the owners of the relevant nodes to give you interactive
>>> access to them.
Jeff Squyres writes:
> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>
>> We can reproduce it with IMB. We could provide access, but we'd have to
>> negotiate with the owners of the relevant nodes to give you interactive
>> access to them. Maybe Brock's would be more accessible? (If you
>> con
I am pretty sure MTL's and BTL's are very different, but just as a note,
This users code (Crash) hangs at MPI_Allreduce() in
Openib
But runs on:
tcp
psm (an mtl, different hardware)
Putting it out there if it does have any bearing. Otherwise ignore.
Brock Palen
www.umich.edu/~brockp
Center f
On May 12, 2011, at 10:13 AM, Jeff Squyres wrote:
> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>
>> We can reproduce it with IMB. We could provide access, but we'd have to
>> negotiate with the owners of the relevant nodes to give you interactive
>> access to them. Maybe Brock's would be mor
On May 11, 2011, at 3:21 PM, Dave Love wrote:
> We can reproduce it with IMB. We could provide access, but we'd have to
> negotiate with the owners of the relevant nodes to give you interactive
> access to them. Maybe Brock's would be more accessible? (If you
> contact me, I may not be able to
On May 11, 2011, at 4:27 PM, Dave Love wrote:
> Ralph Castain writes:
>
>> I'll go back to my earlier comments. Users always claim that their
>> code doesn't have the sync issue, but it has proved to help more often
>> than not, and costs nothing to try,
>
> Could you point to that post, or te
Ralph Castain writes:
> I'll go back to my earlier comments. Users always claim that their
> code doesn't have the sync issue, but it has proved to help more often
> than not, and costs nothing to try,
Could you point to that post, or tell us what to try excatly, given
we're running IMB? Thanks
Jeff Squyres writes:
> We had a user-reported issue of some hangs that the IB vendors have
> been unable to replicate in their respective labs. We *suspect* that
> it may be an issue with the oob openib CPC, but that code is pretty
> old and pretty mature, so all of us would be at least somewhat
Sent from my iPad
On May 11, 2011, at 2:05 PM, Brock Palen wrote:
> On May 9, 2011, at 9:31 AM, Jeff Squyres wrote:
>
>> On May 3, 2011, at 6:42 AM, Dave Love wrote:
>>
We managed to have another user hit the bug that causes collectives (this
time MPI_Bcast() ) to hang on IB that
On May 9, 2011, at 9:31 AM, Jeff Squyres wrote:
> On May 3, 2011, at 6:42 AM, Dave Love wrote:
>
>>> We managed to have another user hit the bug that causes collectives (this
>>> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>>>
>>> btl_openib_cpc_include rdmacm
>>
>> Could someo
On May 3, 2011, at 6:42 AM, Dave Love wrote:
>> We managed to have another user hit the bug that causes collectives (this
>> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>>
>> btl_openib_cpc_include rdmacm
>
> Could someone explain this? We also have problems with collective han
Sorry for the delay on this -- it looks like the problem is caused by messages
like this (from your first message):
[nyx0665.engin.umich.edu:06399] openib BTL: rdmacm IP address not found on port
RDMA CM requires IP addresses (i.e., IPoIB) to be enabled on every port/LID
where you want to use i
Yeah we have ran into more issues, with rdmacm not being avialable on all of
our hosts. So it would be nice to know what we can do to test that a host
would support rdmacm,
Example:
--
No OpenFabrics connection schemes rep
Brock Palen writes:
> We managed to have another user hit the bug that causes collectives (this
> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>
> btl_openib_cpc_include rdmacm
Could someone explain this? We also have problems with collective hangs
with openib/mlx4 (specifically
Attached is the output of running with verbose 100, mpirun --mca
btl_openib_cpc_include rdmacm --mca btl_base_verbose 100 NPmpi
[nyx0665.engin.umich.edu:06399] mca: base: components_open: Looking for btl
components
[nyx0666.engin.umich.edu:07210] mca: base: components_open: Looking for btl
compo
On Apr 27, 2011, at 10:02 AM, Brock Palen wrote:
> Argh, our messed up environment with three generations on infiniband bit us,
> Setting openib_cpc_include to rdmacm causes ib to not be used on our old DDR
> ib on some of our hosts. Note that jobs will never run across our old DDR ib
> and our
Argh, our messed up environment with three generations on infiniband bit us,
Setting openib_cpc_include to rdmacm causes ib to not be used on our old DDR ib
on some of our hosts. Note that jobs will never run across our old DDR ib and
our new QDR stuff where rdmacm does work.
I am doing some te
On Apr 21, 2011, at 6:49 PM, Ralph Castain wrote:
>
> On Apr 21, 2011, at 4:41 PM, Brock Palen wrote:
>
>> Given that part of our cluster is TCP only, openib wouldn't even startup on
>> those hosts
>
> That is correct - it would have no impact on those hosts
>
>> and this would be ignored on
On Apr 21, 2011, at 4:41 PM, Brock Palen wrote:
> Given that part of our cluster is TCP only, openib wouldn't even startup on
> those hosts
That is correct - it would have no impact on those hosts
> and this would be ignored on hosts with IB adaptors?
Ummm...not sure I understand this one.
Given that part of our cluster is TCP only, openib wouldn't even startup on
those hosts and this would be ignored on hosts with IB adaptors?
Just checking thanks!
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985
On Apr 21, 2011, at 6:21 PM, Jeff
Over IB, I'm not sure there is much of a drawback. It might be slightly slower
to establish QP's, but I don't think that matters much.
Over iWARP, rdmacm can cause connection storms as you scale to thousands of MPI
processes.
On Apr 20, 2011, at 5:03 PM, Brock Palen wrote:
> We managed to ha
30 matches
Mail list logo