Re: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas

2017-02-24 Thread Sven Oehme
its more likely you run out of verbsRdmasPerNode which is the top limit
across all connections for a given node.

Sven


On Fri, Feb 24, 2017 at 11:31 AM Aaron Knister 
wrote:

Interesting, thanks Sven!

Could "resources" I'm running out of include NSD server queues?

On 2/23/17 12:12 PM, Sven Oehme wrote:
> all this waiter shows is that you have more in flight than the node or
> connection can currently serve. the reasons for that can be
> misconfiguration or you simply run out of resources on the node, not the
> connection. with latest code you shouldn't see this anymore for node
> limits as the system automatically adjusts the number of maximum RDMA's
> according to the systems Node capabilities :
>
> you should see messages in your mmfslog like :
>
> 2017-02-23_06:19:50.056-0800: [I] VERBS RDMA starting with
> verbsRdmaCm=no verbsRdmaSend=yes verbsRdmaUseMultiCqThreads=yes
> verbsRdmaUseCompVectors=yes
> 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA library libibverbs.so
> (version >= 1.1) loaded and initialized.
> 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA verbsRdmasPerNode increased
> from*_3072 to 3740 because verbsRdmasPerNodeOptimize is set to yes._*
> 2017-02-23_06:19:50.121-0800: [I] VERBS RDMA discover mlx5_5 port 1
> transport IB link  IB NUMA node 16 pkey[0] 0x gid[0] subnet
> 0xFEC00013 id 0xE41D2D0300FDB9CD state ACTIVE
> 2017-02-23_06:19:50.137-0800: [I] VERBS RDMA discover mlx5_4 port 1
> transport IB link  IB NUMA node 16 pkey[0] 0x gid[0] subnet
> 0xFEC00015 id 0xE41D2D0300FDB9CC state ACTIVE
> 2017-02-23_06:19:50.153-0800: [I] VERBS RDMA discover mlx5_3 port 1
> transport IB link  IB NUMA node  1 pkey[0] 0x gid[0] subnet
> 0xFEC00013 id 0xE41D2D0300FDB751 state ACTIVE
> 2017-02-23_06:19:50.169-0800: [I] VERBS RDMA discover mlx5_2 port 1
> transport IB link  IB NUMA node  1 pkey[0] 0x gid[0] subnet
> 0xFEC00015 id 0xE41D2D0300FDB750 state ACTIVE
> 2017-02-23_06:19:50.185-0800: [I] VERBS RDMA discover mlx5_1 port 1
> transport IB link  IB NUMA node  0 pkey[0] 0x gid[0] subnet
> 0xFEC00013 id 0xE41D2D0300FDB78D state ACTIVE
> 2017-02-23_06:19:50.201-0800: [I] VERBS RDMA discover mlx5_0 port 1
> transport IB link  IB NUMA node  0 pkey[0] 0x gid[0] subnet
> 0xFEC00015 id 0xE41D2D0300FDB78C state ACTIVE
>
> we want to eliminate all this configurable limits eventually, but this
> takes time, but as you can see above, we make progress on each release
:-)
>
> Sven
>
>
>
>
> On Thu, Feb 23, 2017 at 9:05 AM Aaron Knister  > wrote:
>
> On a particularly heavy loaded NSD server I'm seeing a lot of these
> messages:
>
> 0x708B63E0 (  15539) waiting 0.004139456 seconds, NSDThread: on
> ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason
> 'waiting for conn rdmas < conn maxrdmas'
> 0x708EED80 (  15584) waiting 0.004075718 seconds, NSDThread: on
> ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason
> 'waiting for conn rdmas < conn maxrdmas'
> 0x708FDF00 (  15596) waiting 0.003965504 seconds, NSDThread: on
> ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason
> 'waiting for conn rdmas < conn maxrdmas'
> 0x709185A0 (  15617) waiting 0.003916346 seconds, NSDThread: on
> ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason
> 'waiting for conn rdmas < conn maxrdmas'
> 0x7092B380 (  15632) waiting 0.003659610 seconds, NSDThread: on
> ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting
> for conn rdmas < conn maxrdmas'
>
> I've tried tweaking verbsRdmasPerConnection but the issue seems to
> persist. Has anyone has encountered this and if so how'd you fix it?
>
> -Aaron
>
> --
> Aaron Knister
> NASA Center for Climate Simulation (Code 606.2)
> Goddard Space Flight Center
> (301) 286-2776 
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org 
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>

--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas

2017-02-24 Thread Aaron Knister

Interesting, thanks Sven!

Could "resources" I'm running out of include NSD server queues?

On 2/23/17 12:12 PM, Sven Oehme wrote:

all this waiter shows is that you have more in flight than the node or
connection can currently serve. the reasons for that can be
misconfiguration or you simply run out of resources on the node, not the
connection. with latest code you shouldn't see this anymore for node
limits as the system automatically adjusts the number of maximum RDMA's
according to the systems Node capabilities :

you should see messages in your mmfslog like :

2017-02-23_06:19:50.056-0800: [I] VERBS RDMA starting with
verbsRdmaCm=no verbsRdmaSend=yes verbsRdmaUseMultiCqThreads=yes
verbsRdmaUseCompVectors=yes
2017-02-23_06:19:50.078-0800: [I] VERBS RDMA library libibverbs.so
(version >= 1.1) loaded and initialized.
2017-02-23_06:19:50.078-0800: [I] VERBS RDMA verbsRdmasPerNode increased
from*_3072 to 3740 because verbsRdmasPerNodeOptimize is set to yes._*
2017-02-23_06:19:50.121-0800: [I] VERBS RDMA discover mlx5_5 port 1
transport IB link  IB NUMA node 16 pkey[0] 0x gid[0] subnet
0xFEC00013 id 0xE41D2D0300FDB9CD state ACTIVE
2017-02-23_06:19:50.137-0800: [I] VERBS RDMA discover mlx5_4 port 1
transport IB link  IB NUMA node 16 pkey[0] 0x gid[0] subnet
0xFEC00015 id 0xE41D2D0300FDB9CC state ACTIVE
2017-02-23_06:19:50.153-0800: [I] VERBS RDMA discover mlx5_3 port 1
transport IB link  IB NUMA node  1 pkey[0] 0x gid[0] subnet
0xFEC00013 id 0xE41D2D0300FDB751 state ACTIVE
2017-02-23_06:19:50.169-0800: [I] VERBS RDMA discover mlx5_2 port 1
transport IB link  IB NUMA node  1 pkey[0] 0x gid[0] subnet
0xFEC00015 id 0xE41D2D0300FDB750 state ACTIVE
2017-02-23_06:19:50.185-0800: [I] VERBS RDMA discover mlx5_1 port 1
transport IB link  IB NUMA node  0 pkey[0] 0x gid[0] subnet
0xFEC00013 id 0xE41D2D0300FDB78D state ACTIVE
2017-02-23_06:19:50.201-0800: [I] VERBS RDMA discover mlx5_0 port 1
transport IB link  IB NUMA node  0 pkey[0] 0x gid[0] subnet
0xFEC00015 id 0xE41D2D0300FDB78C state ACTIVE

we want to eliminate all this configurable limits eventually, but this
takes time, but as you can see above, we make progress on each release  :-)

Sven




On Thu, Feb 23, 2017 at 9:05 AM Aaron Knister > wrote:

On a particularly heavy loaded NSD server I'm seeing a lot of these
messages:

0x708B63E0 (  15539) waiting 0.004139456 seconds, NSDThread: on
ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason
'waiting for conn rdmas < conn maxrdmas'
0x708EED80 (  15584) waiting 0.004075718 seconds, NSDThread: on
ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason
'waiting for conn rdmas < conn maxrdmas'
0x708FDF00 (  15596) waiting 0.003965504 seconds, NSDThread: on
ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason
'waiting for conn rdmas < conn maxrdmas'
0x709185A0 (  15617) waiting 0.003916346 seconds, NSDThread: on
ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason
'waiting for conn rdmas < conn maxrdmas'
0x7092B380 (  15632) waiting 0.003659610 seconds, NSDThread: on
ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting
for conn rdmas < conn maxrdmas'

I've tried tweaking verbsRdmasPerConnection but the issue seems to
persist. Has anyone has encountered this and if so how'd you fix it?

-Aaron

--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776 
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org 
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas

2017-02-23 Thread Sven Oehme
all this waiter shows is that you have more in flight than the node or
connection can currently serve. the reasons for that can be
misconfiguration or you simply run out of resources on the node, not the
connection. with latest code you shouldn't see this anymore for node limits
as the system automatically adjusts the number of maximum RDMA's according
to the systems Node capabilities :

you should see messages in your mmfslog like :

2017-02-23_06:19:50.056-0800: [I] VERBS RDMA starting with verbsRdmaCm=no
verbsRdmaSend=yes verbsRdmaUseMultiCqThreads=yes verbsRdmaUseCompVectors=yes
2017-02-23_06:19:50.078-0800: [I] VERBS RDMA library libibverbs.so (version
>= 1.1) loaded and initialized.
2017-02-23_06:19:50.078-0800: [I] VERBS RDMA verbsRdmasPerNode increased
from* 3072 to 3740 because verbsRdmasPerNodeOptimize is set to yes.*
2017-02-23_06:19:50.121-0800: [I] VERBS RDMA discover mlx5_5 port 1
transport IB link  IB NUMA node 16 pkey[0] 0x gid[0] subnet
0xFEC00013 id 0xE41D2D0300FDB9CD state ACTIVE
2017-02-23_06:19:50.137-0800: [I] VERBS RDMA discover mlx5_4 port 1
transport IB link  IB NUMA node 16 pkey[0] 0x gid[0] subnet
0xFEC00015 id 0xE41D2D0300FDB9CC state ACTIVE
2017-02-23_06:19:50.153-0800: [I] VERBS RDMA discover mlx5_3 port 1
transport IB link  IB NUMA node  1 pkey[0] 0x gid[0] subnet
0xFEC00013 id 0xE41D2D0300FDB751 state ACTIVE
2017-02-23_06:19:50.169-0800: [I] VERBS RDMA discover mlx5_2 port 1
transport IB link  IB NUMA node  1 pkey[0] 0x gid[0] subnet
0xFEC00015 id 0xE41D2D0300FDB750 state ACTIVE
2017-02-23_06:19:50.185-0800: [I] VERBS RDMA discover mlx5_1 port 1
transport IB link  IB NUMA node  0 pkey[0] 0x gid[0] subnet
0xFEC00013 id 0xE41D2D0300FDB78D state ACTIVE
2017-02-23_06:19:50.201-0800: [I] VERBS RDMA discover mlx5_0 port 1
transport IB link  IB NUMA node  0 pkey[0] 0x gid[0] subnet
0xFEC00015 id 0xE41D2D0300FDB78C state ACTIVE

we want to eliminate all this configurable limits eventually, but this
takes time, but as you can see above, we make progress on each release  :-)

Sven




On Thu, Feb 23, 2017 at 9:05 AM Aaron Knister 
wrote:

> On a particularly heavy loaded NSD server I'm seeing a lot of these
> messages:
>
> 0x708B63E0 (  15539) waiting 0.004139456 seconds, NSDThread: on
> ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason
> 'waiting for conn rdmas < conn maxrdmas'
> 0x708EED80 (  15584) waiting 0.004075718 seconds, NSDThread: on
> ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason
> 'waiting for conn rdmas < conn maxrdmas'
> 0x708FDF00 (  15596) waiting 0.003965504 seconds, NSDThread: on
> ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason
> 'waiting for conn rdmas < conn maxrdmas'
> 0x709185A0 (  15617) waiting 0.003916346 seconds, NSDThread: on
> ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason
> 'waiting for conn rdmas < conn maxrdmas'
> 0x7092B380 (  15632) waiting 0.003659610 seconds, NSDThread: on
> ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting
> for conn rdmas < conn maxrdmas'
>
> I've tried tweaking verbsRdmasPerConnection but the issue seems to
> persist. Has anyone has encountered this and if so how'd you fix it?
>
> -Aaron
>
> --
> Aaron Knister
> NASA Center for Climate Simulation (Code 606.2)
> Goddard Space Flight Center
> (301) 286-2776
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss