Re: [gpfsug-discuss] nodes being ejected out of the cluster

2017-01-11 Thread Jan-Frode Myklebust
gt; verbsRdmasPerConnection 256
>
> [gss_ppc64]
>
> verbsRdmasPerNode 3200
>
> [ems1-fdr,compute]
>
> verbsRdmasPerNode 1024
>
> [ems1-fdr,compute,gss_ppc64]
>
> verbsSendBufferMemoryMB 1024
>
> verbsRdmasPerNodeOptimize yes
>
> verbsRdmaUseMultiCqThreads yes
>
> [ems1-fdr,compute]
>
> ignorePrefetchLUNCount yes
>
> [gss_ppc64]
>
> scatterBufferSize 256K
>
> [ems1-fdr,compute]
>
> scatterBufferSize 256k
>
> syncIntervalStrict yes
>
> [ems1-fdr,compute,gss_ppc64]
>
> nsdClientCksumTypeLocal ck64
>
> nsdClientCksumTypeRemote ck64
>
> [gss_ppc64]
>
> pagepool 72856M
>
> [ems1-fdr]
>
> pagepool 17544M
>
> [compute]
>
> pagepool 4g
>
> [ems1-fdr,qsched03-ib0,quser10-fdr,compute,gss_ppc64]
>
> verbsRdma enable
>
> [gss_ppc64]
>
> verbsPorts mlx5_0/1 mlx5_0/2 mlx5_1/1 mlx5_1/2
>
> [ems1-fdr]
>
> verbsPorts mlx5_0/1 mlx5_0/2
>
> [qsched03-ib0,quser10-fdr,compute]
>
> verbsPorts mlx4_0/1
>
> [common]
>
> autoload no
>
> [ems1-fdr,compute,gss_ppc64]
>
> maxStatCache 0
>
> [common]
>
> envVar MLX4_USE_MUTEX=1 MLX5_SHUT_UP_BF=1 MLX5_USE_MUTEX=1
>
> deadlockOverloadThreshold 0
>
> deadlockDetectionThreshold 0
>
> adminMode central
>
>
> File systems in cluster ess-qstorage.it.northwestern.edu:
>
> -
>
> /dev/home
>
> /dev/hpc
>
> /dev/projects
>
> /dev/tthome
>
> On Wed, Jan 11, 2017 at 9:16 AM Luis Bolinches <luis.bolinc...@fi.ibm.com>
> wrote:
>
> In addition to what Olaf has said
>
> ESS upgrades include mellanox modules upgrades in the ESS nodes. In fact,
> on those noes you should do not update those solo (unless support says so
> in your PMR), so if that's been the recommendation, I suggest you look at
> it.
>
> Changelog on ESS 4.0.4 (no idea what ESS level you are running)
>
>
>   c) Support of MLNX_OFED_LINUX-3.2-2.0.0.1
>  - Updated from MLNX_OFED_LINUX-3.1-1.0.6.1 (ESS 4.0, 4.0.1, 4.0.2)
>  - Updated from MLNX_OFED_LINUX-3.1-1.0.0.2 (ESS 3.5.x)
>  - Updated from MLNX_OFED_LINUX-2.4-1.0.2 (ESS 3.0.x)
>  - Support for PCIe3 LP 2-port 100 Gb EDR InfiniBand adapter x16 (FC EC3E)
>- Requires System FW level FW840.20 (SV840_104)
>  - No changes from ESS 4.0.3
>
>
> --
> Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations
>
> Luis Bolinches
> Lab Services
> http://www-03.ibm.com/systems/services/labservices/
>
> IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland
> Phone: +358 503112585 <+358%2050%203112585>
>
> "If you continually give you will continually have." Anonymous
>
>
>
> - Original message -
> From: "Olaf Weiser" <olaf.wei...@de.ibm.com>
> Sent by: gpfsug-discuss-boun...@spectrumscale.org
> To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>
>
> Cc:
> Subject: Re: [gpfsug-discuss] nodes being ejected out of the cluster
> Date: Wed, Jan 11, 2017 5:03 PM
>
> most likely, there's smth wrong with your IB fabric ...
> you say, you run ~ 700 nodes ? ...
> Are you running with *verbsRdmaSend*enabled ? ,if so, please consider to
> disable  - and discuss this within the PMR
> another issue, you may check is  - Are you running the IPoIB in connected
> mode or datagram ... but as I said, please discuss this within the PMR ..
> there are to much dependencies to discuss this here ..
>
>
> cheers
>
>
> Mit freundlichen Grüßen / Kind regards
>
>
> Olaf Weiser
>
> EMEA Storage Competence Center Mainz, German / IBM Systems, Storage
> Platform,
>
> ---
> IBM Deutschland
> IBM Allee 1
> 71139 Ehningen
> Phone: +49-170-579-44-66 <+49%20170%205794466>
> E-Mail: olaf.wei...@de.ibm.com
>
> ---
> IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter
> Geschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert
> Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner
> Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
> HRB 14562 / WEEE-Reg.-Nr. DE 99369940
>
>
>
> From:Damir Krstic <damir.krs...@gmail.com>
> To:gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>
> Date:01/11/2017 03:39 PM
> Subject:[gpfsug-discuss] nodes being eje

Re: [gpfsug-discuss] nodes being ejected out of the cluster

2017-01-11 Thread Damir Krstic
e yes
>
> verbsRdmaUseMultiCqThreads yes
>
> [ems1-fdr,compute]
>
> ignorePrefetchLUNCount yes
>
> [gss_ppc64]
>
> scatterBufferSize 256K
>
> [ems1-fdr,compute]
>
> scatterBufferSize 256k
>
> syncIntervalStrict yes
>
> [ems1-fdr,compute,gss_ppc64]
>
> nsdClientCksumTypeLocal ck64
>
> nsdClientCksumTypeRemote ck64
>
> [gss_ppc64]
>
> pagepool 72856M
>
> [ems1-fdr]
>
> pagepool 17544M
>
> [compute]
>
> pagepool 4g
>
> [ems1-fdr,qsched03-ib0,quser10-fdr,compute,gss_ppc64]
>
> verbsRdma enable
>
> [gss_ppc64]
>
> verbsPorts mlx5_0/1 mlx5_0/2 mlx5_1/1 mlx5_1/2
>
> [ems1-fdr]
>
> verbsPorts mlx5_0/1 mlx5_0/2
>
> [qsched03-ib0,quser10-fdr,compute]
>
> verbsPorts mlx4_0/1
>
> [common]
>
> autoload no
>
> [ems1-fdr,compute,gss_ppc64]
>
> maxStatCache 0
>
> [common]
>
> envVar MLX4_USE_MUTEX=1 MLX5_SHUT_UP_BF=1 MLX5_USE_MUTEX=1
>
> deadlockOverloadThreshold 0
>
> deadlockDetectionThreshold 0
>
> adminMode central
>
>
> File systems in cluster ess-qstorage.it.northwestern.edu:
>
> -
>
> /dev/home
>
> /dev/hpc
>
> /dev/projects
>
> /dev/tthome
>
> On Wed, Jan 11, 2017 at 9:16 AM Luis Bolinches <luis.bolinc...@fi.ibm.com>
> wrote:
>
> In addition to what Olaf has said
>
> ESS upgrades include mellanox modules upgrades in the ESS nodes. In fact,
> on those noes you should do not update those solo (unless support says so
> in your PMR), so if that's been the recommendation, I suggest you look at
> it.
>
> Changelog on ESS 4.0.4 (no idea what ESS level you are running)
>
>
>   c) Support of MLNX_OFED_LINUX-3.2-2.0.0.1
>  - Updated from MLNX_OFED_LINUX-3.1-1.0.6.1 (ESS 4.0, 4.0.1, 4.0.2)
>  - Updated from MLNX_OFED_LINUX-3.1-1.0.0.2 (ESS 3.5.x)
>  - Updated from MLNX_OFED_LINUX-2.4-1.0.2 (ESS 3.0.x)
>  - Support for PCIe3 LP 2-port 100 Gb EDR InfiniBand adapter x16 (FC EC3E)
>- Requires System FW level FW840.20 (SV840_104)
>  - No changes from ESS 4.0.3
>
>
> --
> Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations
>
> Luis Bolinches
> Lab Services
> http://www-03.ibm.com/systems/services/labservices/
>
> IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland
> Phone: +358 503112585 <+358%2050%203112585>
>
> "If you continually give you will continually have." Anonymous
>
>
>
> - Original message -
> From: "Olaf Weiser" <olaf.wei...@de.ibm.com>
> Sent by: gpfsug-discuss-boun...@spectrumscale.org
> To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>
>
> Cc:
> Subject: Re: [gpfsug-discuss] nodes being ejected out of the cluster
> Date: Wed, Jan 11, 2017 5:03 PM
>
> most likely, there's smth wrong with your IB fabric ...
> you say, you run ~ 700 nodes ? ...
> Are you running with *verbsRdmaSend*enabled ? ,if so, please consider to
> disable  - and discuss this within the PMR
> another issue, you may check is  - Are you running the IPoIB in connected
> mode or datagram ... but as I said, please discuss this within the PMR ..
> there are to much dependencies to discuss this here ..
>
>
> cheers
>
>
> Mit freundlichen Grüßen / Kind regards
>
>
> Olaf Weiser
>
> EMEA Storage Competence Center Mainz, German / IBM Systems, Storage
> Platform,
>
> ---
> IBM Deutschland
> IBM Allee 1
> 71139 Ehningen
> Phone: +49-170-579-44-66 <+49%20170%205794466>
> E-Mail: olaf.wei...@de.ibm.com
>
> ---
> IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter
> Geschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert
> Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner
> Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
> HRB 14562 / WEEE-Reg.-Nr. DE 99369940
>
>
>
> From:Damir Krstic <damir.krs...@gmail.com>
> To:gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>
> Date:01/11/2017 03:39 PM
> Subject:[gpfsug-discuss] nodes being ejected out of the cluster
> Sent by:gpfsug-discuss-boun...@spectrumscale.org
> --
>
>
>
> We are running GPFS 4.2 on our cluster (around 700 compute nodes). Our
> storage (ESS GL6) is also running GPFS 4.2. Com

Re: [gpfsug-discuss] nodes being ejected out of the cluster

2017-01-11 Thread Jan-Frode Myklebust
 256k
>
> syncIntervalStrict yes
>
> [ems1-fdr,compute,gss_ppc64]
>
> nsdClientCksumTypeLocal ck64
>
> nsdClientCksumTypeRemote ck64
>
> [gss_ppc64]
>
> pagepool 72856M
>
> [ems1-fdr]
>
> pagepool 17544M
>
> [compute]
>
> pagepool 4g
>
> [ems1-fdr,qsched03-ib0,quser10-fdr,compute,gss_ppc64]
>
> verbsRdma enable
>
> [gss_ppc64]
>
> verbsPorts mlx5_0/1 mlx5_0/2 mlx5_1/1 mlx5_1/2
>
> [ems1-fdr]
>
> verbsPorts mlx5_0/1 mlx5_0/2
>
> [qsched03-ib0,quser10-fdr,compute]
>
> verbsPorts mlx4_0/1
>
> [common]
>
> autoload no
>
> [ems1-fdr,compute,gss_ppc64]
>
> maxStatCache 0
>
> [common]
>
> envVar MLX4_USE_MUTEX=1 MLX5_SHUT_UP_BF=1 MLX5_USE_MUTEX=1
>
> deadlockOverloadThreshold 0
>
> deadlockDetectionThreshold 0
>
> adminMode central
>
>
> File systems in cluster ess-qstorage.it.northwestern.edu:
>
> -
>
> /dev/home
>
> /dev/hpc
>
> /dev/projects
>
> /dev/tthome
>
> On Wed, Jan 11, 2017 at 9:16 AM Luis Bolinches <luis.bolinc...@fi.ibm.com>
> wrote:
>
> In addition to what Olaf has said
>
> ESS upgrades include mellanox modules upgrades in the ESS nodes. In fact,
> on those noes you should do not update those solo (unless support says so
> in your PMR), so if that's been the recommendation, I suggest you look at
> it.
>
> Changelog on ESS 4.0.4 (no idea what ESS level you are running)
>
>
>   c) Support of MLNX_OFED_LINUX-3.2-2.0.0.1
>  - Updated from MLNX_OFED_LINUX-3.1-1.0.6.1 (ESS 4.0, 4.0.1, 4.0.2)
>  - Updated from MLNX_OFED_LINUX-3.1-1.0.0.2 (ESS 3.5.x)
>  - Updated from MLNX_OFED_LINUX-2.4-1.0.2 (ESS 3.0.x)
>  - Support for PCIe3 LP 2-port 100 Gb EDR InfiniBand adapter x16 (FC EC3E)
>- Requires System FW level FW840.20 (SV840_104)
>  - No changes from ESS 4.0.3
>
>
> --
> Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations
>
> Luis Bolinches
> Lab Services
> http://www-03.ibm.com/systems/services/labservices/
>
> IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland
> Phone: +358 503112585 <+358%2050%203112585>
>
> "If you continually give you will continually have." Anonymous
>
>
>
> - Original message -
> From: "Olaf Weiser" <olaf.wei...@de.ibm.com>
> Sent by: gpfsug-discuss-boun...@spectrumscale.org
> To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>
>
> Cc:
> Subject: Re: [gpfsug-discuss] nodes being ejected out of the cluster
> Date: Wed, Jan 11, 2017 5:03 PM
>
> most likely, there's smth wrong with your IB fabric ...
> you say, you run ~ 700 nodes ? ...
> Are you running with *verbsRdmaSend*enabled ? ,if so, please consider to
> disable  - and discuss this within the PMR
> another issue, you may check is  - Are you running the IPoIB in connected
> mode or datagram ... but as I said, please discuss this within the PMR ..
> there are to much dependencies to discuss this here ..
>
>
> cheers
>
>
> Mit freundlichen Grüßen / Kind regards
>
>
> Olaf Weiser
>
> EMEA Storage Competence Center Mainz, German / IBM Systems, Storage
> Platform,
>
> ---
> IBM Deutschland
> IBM Allee 1
> 71139 Ehningen
> Phone: +49-170-579-44-66 <+49%20170%205794466>
> E-Mail: olaf.wei...@de.ibm.com
>
> ---
> IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter
> Geschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert
> Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner
> Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
> HRB 14562 / WEEE-Reg.-Nr. DE 99369940
>
>
>
> From:Damir Krstic <damir.krs...@gmail.com>
> To:gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>
> Date:01/11/2017 03:39 PM
> Subject:[gpfsug-discuss] nodes being ejected out of the cluster
> Sent by:gpfsug-discuss-boun...@spectrumscale.org
> --
>
>
>
> We are running GPFS 4.2 on our cluster (around 700 compute nodes). Our
> storage (ESS GL6) is also running GPFS 4.2. Compute nodes and storage are
> connected via Infiniband (FDR14). At the time of implementation of ESS, we
> were instructed to enable RDMA in addition to IPoIB. Previously we only ran
> IPoIB on our GPFS3.5 cluster.
>
> E

Re: [gpfsug-discuss] nodes being ejected out of the cluster

2017-01-11 Thread Damir Krstic
LINUX-3.2-2.0.0.1
>  - Updated from MLNX_OFED_LINUX-3.1-1.0.6.1 (ESS 4.0, 4.0.1, 4.0.2)
>  - Updated from MLNX_OFED_LINUX-3.1-1.0.0.2 (ESS 3.5.x)
>  - Updated from MLNX_OFED_LINUX-2.4-1.0.2 (ESS 3.0.x)
>  - Support for PCIe3 LP 2-port 100 Gb EDR InfiniBand adapter x16 (FC EC3E)
>- Requires System FW level FW840.20 (SV840_104)
>  - No changes from ESS 4.0.3
>
>
> --
> Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations
>
> Luis Bolinches
> Lab Services
> http://www-03.ibm.com/systems/services/labservices/
>
> IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland
> Phone: +358 503112585 <+358%2050%203112585>
>
> "If you continually give you will continually have." Anonymous
>
>
>
> - Original message -
> From: "Olaf Weiser" <olaf.wei...@de.ibm.com>
> Sent by: gpfsug-discuss-boun...@spectrumscale.org
> To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>
>
> Cc:
> Subject: Re: [gpfsug-discuss] nodes being ejected out of the cluster
> Date: Wed, Jan 11, 2017 5:03 PM
>
> most likely, there's smth wrong with your IB fabric ...
> you say, you run ~ 700 nodes ? ...
> Are you running with *verbsRdmaSend*enabled ? ,if so, please consider to
> disable  - and discuss this within the PMR
> another issue, you may check is  - Are you running the IPoIB in connected
> mode or datagram ... but as I said, please discuss this within the PMR ..
> there are to much dependencies to discuss this here ..
>
>
> cheers
>
>
> Mit freundlichen Grüßen / Kind regards
>
>
> Olaf Weiser
>
> EMEA Storage Competence Center Mainz, German / IBM Systems, Storage
> Platform,
>
> ---
> IBM Deutschland
> IBM Allee 1
> 71139 Ehningen
> Phone: +49-170-579-44-66 <+49%20170%205794466>
> E-Mail: olaf.wei...@de.ibm.com
>
> ---
> IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter
> Geschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert
> Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner
> Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
> HRB 14562 / WEEE-Reg.-Nr. DE 99369940
>
>
>
> From:Damir Krstic <damir.krs...@gmail.com>
> To:gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>
> Date:01/11/2017 03:39 PM
> Subject:[gpfsug-discuss] nodes being ejected out of the cluster
> Sent by:gpfsug-discuss-boun...@spectrumscale.org
> --
>
>
>
> We are running GPFS 4.2 on our cluster (around 700 compute nodes). Our
> storage (ESS GL6) is also running GPFS 4.2. Compute nodes and storage are
> connected via Infiniband (FDR14). At the time of implementation of ESS, we
> were instructed to enable RDMA in addition to IPoIB. Previously we only ran
> IPoIB on our GPFS3.5 cluster.
>
> Every since the implementation (sometime back in July of 2016) we see a
> lot of compute nodes being ejected. What usually precedes the ejection are
> following messages:
>
> Jan 11 02:03:15 quser13 mmfs: [E] VERBS RDMA rdma send error
> IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum
> 0 vendor_err 135
> Jan 11 02:03:15 quser13 mmfs: [E] VERBS RDMA closed connection to
> 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
> IBV_WC_RNR_RETRY_EXC_ERR index 2
> Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA rdma send error
> IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum
> 0 vendor_err 135
> Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA closed connection to
> 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
> IBV_WC_WR_FLUSH_ERR index 1
> Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA rdma send error
> IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum
> 0 vendor_err 135
> Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA closed connection to
> 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
> IBV_WC_RNR_RETRY_EXC_ERR index 2
> Jan 11 02:06:38 quser11 mmfs: [E] VERBS RDMA rdma send error
> IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum
> 0 vendor_err 135
> Jan 11 02:06:38 quser11 mmfs: [E] VERBS RDMA closed connection to
> 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
> IBV_WC_WR_FLUSH_ERR index 400
>
> Even our ESS IO server sometimes ends up being ejected

Re: [gpfsug-discuss] nodes being ejected out of the cluster

2017-01-11 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
The RDMA errors I think are secondary to what's going on with either your IPoIB 
or Ethernet fabrics that's causing I assume IPoIB communication breakdowns and 
expulsions. We've had entire IB fabrics go offline and if the nodes werent 
depending on it for daemon communication nobody got expelled. Do you have a 
subnet defined for your IPoIB network or are your nodes daemon interfaces 
already set to their IPoIB interface? Have you checked your SM logs?



From: Damir Krstic
Sent: 1/11/17, 9:39 AM
To: gpfsug main discussion list
Subject: [gpfsug-discuss] nodes being ejected out of the cluster
We are running GPFS 4.2 on our cluster (around 700 compute nodes). Our storage 
(ESS GL6) is also running GPFS 4.2. Compute nodes and storage are connected via 
Infiniband (FDR14). At the time of implementation of ESS, we were instructed to 
enable RDMA in addition to IPoIB. Previously we only ran IPoIB on our GPFS3.5 
cluster.

Every since the implementation (sometime back in July of 2016) we see a lot of 
compute nodes being ejected. What usually precedes the ejection are following 
messages:

Jan 11 02:03:15 quser13 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 
vendor_err 135
Jan 11 02:03:15 quser13 mmfs: [E] VERBS RDMA closed connection to 172.41.2.5 
(gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error 
IBV_WC_RNR_RETRY_EXC_ERR index 2
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 
vendor_err 135
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA closed connection to 172.41.2.5 
(gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error IBV_WC_WR_FLUSH_ERR 
index 1
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 
vendor_err 135
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA closed connection to 172.41.2.5 
(gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error 
IBV_WC_RNR_RETRY_EXC_ERR index 2
Jan 11 02:06:38 quser11 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 
vendor_err 135
Jan 11 02:06:38 quser11 mmfs: [E] VERBS RDMA closed connection to 172.41.2.5 
(gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error IBV_WC_WR_FLUSH_ERR 
index 400

Even our ESS IO server sometimes ends up being ejected (case in point - 
yesterday morning):

Jan 10 11:23:42 gssio2 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_1 port 1 fabnum 0 
vendor_err 135
Jan 10 11:23:42 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1 
(gssio1-fdr) on mlx5_1 port 1 fabnum 0 due to send error 
IBV_WC_RNR_RETRY_EXC_ERR index 3001
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_1 port 2 fabnum 0 
vendor_err 135
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1 
(gssio1-fdr) on mlx5_1 port 2 fabnum 0 due to send error 
IBV_WC_RNR_RETRY_EXC_ERR index 2671
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_0 port 2 fabnum 0 
vendor_err 135
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1 
(gssio1-fdr) on mlx5_0 port 2 fabnum 0 due to send error 
IBV_WC_RNR_RETRY_EXC_ERR index 2495
Jan 10 11:23:44 gssio2 mmfs: [E] VERBS RDMA rdma send error 
IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_0 port 1 fabnum 0 
vendor_err 135
Jan 10 11:23:44 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1 
(gssio1-fdr) on mlx5_0 port 1 fabnum 0 due to send error 
IBV_WC_RNR_RETRY_EXC_ERR index 3077
Jan 10 11:24:11 gssio2 mmfs: [N] Node 172.41.2.1 (gssio1-fdr) lease renewal is 
overdue. Pinging to check if it is alive

I've had multiple PMRs open for this issue, and I am told that our ESS needs 
code level upgrades in order to fix this issue. Looking at the errors, I think 
the issue is Infiniband related, and I am wondering if anyone on this list has 
seen similar issues?

Thanks for your help in advance.

Damir
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] nodes being ejected out of the cluster

2017-01-11 Thread Olaf Weiser
most likely, there's smth wrong with your
IB fabric ... you say, you run ~ 700 nodes ? ...Are you running with verbsRdmaSendenabled ? ,if so, please consider to disable  - and discuss this within
the PMR another issue, you may check is  -
Are you running the IPoIB in connected mode or datagram ... but as I said,
please discuss this within the PMR .. there are to much dependencies to
discuss this here .. cheersMit freundlichen Grüßen / Kind regards Olaf Weiser EMEA Storage Competence Center Mainz, German / IBM Systems, Storage Platform,---IBM DeutschlandIBM Allee 171139 EhningenPhone: +49-170-579-44-66E-Mail: olaf.wei...@de.ibm.com---IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin JetterGeschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert
Janzen, Dr. Christian Keller, Ivo Koerner, Markus KoernerSitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
HRB 14562 / WEEE-Reg.-Nr. DE 99369940 From:      
 Damir Krstic To:      
 gpfsug main discussion
list Date:      
 01/11/2017 03:39 PMSubject:    
   [gpfsug-discuss]
nodes being ejected out of the clusterSent by:    
   gpfsug-discuss-boun...@spectrumscale.orgWe are running GPFS 4.2 on our cluster (around 700 compute
nodes). Our storage (ESS GL6) is also running GPFS 4.2. Compute nodes and
storage are connected via Infiniband (FDR14). At the time of implementation
of ESS, we were instructed to enable RDMA in addition to IPoIB. Previously
we only ran IPoIB on our GPFS3.5 cluster.Every since the implementation (sometime back in July
of 2016) we see a lot of compute nodes being ejected. What usually precedes
the ejection are following messages:Jan 11 02:03:15 quser13 mmfs: [E] VERBS RDMA rdma send
error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port
1 fabnum 0 vendor_err 135 Jan 11 02:03:15 quser13 mmfs: [E] VERBS RDMA closed connection
to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
IBV_WC_RNR_RETRY_EXC_ERR index 2Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA rdma send
error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port
1 fabnum 0 vendor_err 135 Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA closed connection
to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
IBV_WC_WR_FLUSH_ERR index 1Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA rdma send
error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port
1 fabnum 0 vendor_err 135 Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA closed connection
to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
IBV_WC_RNR_RETRY_EXC_ERR index 2Jan 11 02:06:38 quser11 mmfs: [E] VERBS RDMA rdma send
error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port
1 fabnum 0 vendor_err 135 Jan 11 02:06:38 quser11 mmfs: [E] VERBS RDMA closed connection
to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error
IBV_WC_WR_FLUSH_ERR index 400Even our ESS IO server sometimes ends up being ejected
(case in point - yesterday morning):Jan 10 11:23:42 gssio2 mmfs: [E] VERBS RDMA rdma send
error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_1 port
1 fabnum 0 vendor_err 135Jan 10 11:23:42 gssio2 mmfs: [E] VERBS RDMA closed connection
to 172.41.2.1 (gssio1-fdr) on mlx5_1 port 1 fabnum 0 due to send error
IBV_WC_RNR_RETRY_EXC_ERR index 3001Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA rdma send
error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_1 port
2 fabnum 0 vendor_err 135Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA closed connection
to 172.41.2.1 (gssio1-fdr) on mlx5_1 port 2 fabnum 0 due to send error
IBV_WC_RNR_RETRY_EXC_ERR index 2671Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA rdma send
error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_0 port
2 fabnum 0 vendor_err 135Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA closed connection
to 172.41.2.1 (gssio1-fdr) on mlx5_0 port 2 fabnum 0 due to send error
IBV_WC_RNR_RETRY_EXC_ERR index 2495Jan 10 11:23:44 gssio2 mmfs: [E] VERBS RDMA rdma send
error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_0 port
1 fabnum 0 vendor_err 135Jan 10 11:23:44 gssio2 mmfs: [E] VERBS RDMA closed connection
to 172.41.2.1 (gssio1-fdr) on mlx5_0 port 1 fabnum 0 due to send error
IBV_WC_RNR_RETRY_EXC_ERR index 3077Jan 10 11:24:11 gssio2 mmfs: [N] Node 172.41.2.1 (gssio1-fdr)
lease renewal is overdue. Pinging to check if it is aliveI've had multiple PMRs open for this issue, and I am told
that our ESS needs code level upgrades in order to fix this issue. Looking
at the errors, I think the issue is Infiniband related, and I am wondering
if anyone on this list has seen