do you use ipoib for the rdma nodes or regular ethernet? and what OS are
you on?
we had issue with el7.1 kernel and ipoib; there's packet loss with ipoib
and mlnx_ofed (and mlnx engineering told that it might be in basic ofed
from el7.1 too). 7.0 kernels are ok) and client expels were the first
signs on our setup.
stijn
On 07/01/2015 05:58 PM, Vic Cornell wrote:
If it used to work then its probably not config. Most expels are the result of
network connectivity problems.
If your cluster is not too big try looking at ping from every node to every
other node and look for large latencies.
Also look to see who is expelling who. Ie - if your RDMA nodes are being
expelled by non-RDMA nodes. It may point to a weakness in your network which
GPFS ,being as it is a great finder of weaknesses, is having a problem with.
Also more details (network config etc) will elicit more detailed suggestions.
Cheers,
Vic
On 1 Jul 2015, at 16:52, Chris Hunter <[email protected]> wrote:
Hi UG list,
We have a large rdma/tcp multi-cluster gpfs filesystem, about 2/3 of clients
use RDMA. We see a large number of expels of rdma clients but less of the tcp
clients.
Most of the gpfs config is at defaults. We are unclear if any of the non-RDMA
config items (eg. Idle socket timeout) would help our issue. Any suggestions on
gpfs config parameters we should investigate ?
thank-you in advance,
chris hunter
yale hpc group
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss