As I mentioned, I am doing a test install to see what I want to run for
deployment. We have run a couple Lustre installs, one 1.8.x based and a
current production one that is 2.3. The Lustre 2.3 server set has been
up for 750 days and has been very solid. This test replaces the old 1.8
setup and I need to come up with a consistent set of sever and clients
that I can run on our clusters. The cluster (Rocks based) will get
upgraded, most likely, once we have a working set. I have a set of
compute nodes that will be set up to run either CentOS 6.6 or 6.7.
I started with 2.7 since that is what I got pointed to when I went to
the lustre.org download page. The "Most Recent Release" points me at the
2.7.0 tree. If I follow the path to download source on that page,
git clone git://git.hpdd.intel.com/fs/lustre-release.git
It is not even apparent from the downloaded tree which version I would
be building. The Changelog file mentions 2.8 and 2.7. Everything on the
Lustre Download page seems to indicate I should be downloading 2.7.
Since I started with a clean install of a RHEL 6.6 on my server set, I
had the expectation that that pre-compiled server binaries would give me
a working set to test. That is when the frustration started. I tried
searching for clues by looking at errors that I saw, but I did not find
much that duplicated what I was seeing. I just saw some odd mentions
about IB having issues in 2.6.32-504.8.1. This did not directly
correlate with my issues but I figured I would try a later kernel. That
is whey I pulled the nightly build off of build.hpdd.intel.com and found
I could at least establish a set of servers that would talk to each other.
That is where I am at now. I am trying to wrap my head around where my
issues lie. Is the problem specific to my Qlogic InfiniPath_QLE7240
cards? Is it the underlying OS provided IB drivers? I guess I am just
really surprised that the distribution pointed to on the download page,
fails out of the box on a set of servers with a clean install of the
specified OS. I just figured I must be doing something wrong (which may
still be the case).
At this point, it looks like I should be backing out 2.7 and build this
with the current 2.5 release.
Before I do that, however, I would like to gain some understanding as to
what I am seeing right now. I have the server set built with 2.7.0 and
the 2.6.32-573.8.1.el6_lustre.g8438f2a.x86_64 kernel on RHEL 6.6 (SL 6.6).
I rebuilt the 2.7.0 Lustre client on a RHEL (CentOS) 6.6 client, and I
could not mount the file system. It will mount my production Lustre file
system from another server set (2.3.0) with out a problem. I also tried
with a RHEL 6.7 install, with the 2.7 Lustre client rebuilt for the
kernel (2.6.32-573.8.1.el6.x86_64). The client will not mount the 2.7
Lustre file system and I cannot even (lctl ping) the server from the client.
On the client
[root@athena-head ~]# lctl ping 172.19.120.29@o2ib
failed to ping 172.19.120.29@o2ib: Input/output error
In dmesg
LNetError: 1444:0:(o2iblnd_cb.c:2649:kiblnd_rejected())
172.19.120.29@o2ib rejected: incompatible # of RDMA fragments 32, 256
On the Lustre MDS server.
Dec 3 18:14:08 lustre-mds kernel: LNet:
1493:0:(o2iblnd_cb.c:2278:kiblnd_passive_connect()) Can't accept conn
from 172.19.120.2@o2ib (version 12): max_frags 256 too large (32 wanted)
Trying to mount on the client
[root@athena-head ~]# uname -a
Linux athena-head.aem.umn.edu 2.6.32-573.8.1.el6.x86_64 #1 SMP Tue Nov
10 18:01:38 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@athena-head ~]# mount -t lustre 172.19.120.29@o2ib:/ltest /ltest
mount.lustre: mount 172.19.120.29@o2ib:/ltest at /ltest failed:
Input/output error
Is the MGS running?
Dec 3 18:21:16 athena-head kernel: LNetError:
1444:0:(o2iblnd_cb.c:2649:kiblnd_rejected()) 172.19.120.29@o2ib
rejected: incompatible # of RDMA fragments 32, 256
Dec 3 18:21:16 athena-head kernel: Lustre:
6091:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has
failed due to network error: [sent 1449188476/real 1449188476]
req@ffff88002f810c80 x1519567173058612/t0(0)
o250->MGC172.19.120.29@[email protected]@o2ib:26/25 lens 400/544 e 0 to
1 dl 1449188481 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 3 18:21:41 athena-head kernel: LNetError:
1444:0:(o2iblnd_cb.c:2649:kiblnd_rejected()) 172.19.120.29@o2ib
rejected: incompatible # of RDMA fragments 32, 256
Dec 3 18:21:41 athena-head kernel: Lustre:
6091:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has
failed due to network error: [sent 1449188501/real 1449188501]
req@ffff88021e742c80 x1519567173058628/t0(0)
o250->MGC172.19.120.29@[email protected]@o2ib:26/25 lens 400/544 e 0 to
1 dl 1449188511 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Dec 3 18:21:53 athena-head kernel: LustreError: 15c-8:
MGC172.19.120.29@o2ib: The configuration from log 'ltest-client' failed
(-5). This may be the result of communication errors between this node
and the MGS, a bad configuration, or other errors. See the syslog for
more information.
Dec 3 18:21:53 athena-head kernel: Lustre: Unmounted ltest-client
Dec 3 18:21:53 athena-head kernel: LustreError:
7346:0:(obd_mount.c:1339:lustre_fill_super()) Unable to mount (-5)
On the server
Dec 3 18:21:41 lustre-mds kernel: LNet:
1493:0:(o2iblnd_cb.c:2278:kiblnd_passive_connect()) Can't accept conn
from 172.19.120.2@o2ib (version 12): max_frags 256 too large (32 wanted)
On 12/04/2015 06:49 AM, [email protected] wrote:
Hi,
I honestly don't know if the compiled versions available here are meant
to be used by everyone but they are publicly browsable on Intel Jenkins :
https://build.hpdd.intel.com
but as the source is publicly available from the whamcloud git, there
imo might not be any problem
If you are in production stick to the 2.5.
Regards
Le 04-12-2015 12:18, Jon Tegner a écrit :
Hi,
Where do you find the 2.7.x-releases? I thought fixes were only
released for the Intel maintenance version?
Regards,
/jon
On 12/04/2015 11:43 AM, [email protected] wrote:
Hello Ray,
One consideration first : You try the 2.7 version which is not the
production one (aka 2.5). From this perspective wether you run 2.7.0
or 2.7.x won't make any big difference, it is the develpment release.
Then if I understand the problem comes from the infiniband driver
module which is buggy in the 2.6.32-504.8.1 kernel, meaning that you
have to update the kernel to fix it. Doing this may result that the
2.7.0 version on the site, compiled on an older kernel version, will
refuse to load then. (because kernel modules - i.e the lustre ones
here - relies on features that may change between different kernel
version making it incompatible)
In any case you can try to rebuild the 2.7.0 version from the source
to your new kernel. The procedure is quite easy :
https://wiki.hpdd.intel.com/display/PUB/Rebuilding+the+Lustre-client+rpms+for+a+new+kernel
It will regenerate the 2.7.0 client uppon your newer kernel with the
working infinband modules, but the stability is not garanted as the
2.7 branch is under development anyway.
Or use a precompiled one on the build site if you can't (some nasty
bugs in the base 2.x.0 version are fixed in the latest builds)
The only thing is to stick to the very same version on mds and oss
and at least the same or newer version for the clients.
Regards
Le 03-12-2015 16:13, Ray Muno a écrit :
I am trying to set up a test deployment of Lustre 2.7.
I pulled RPMS from http://lustre.org/download/ and installed them on a
set of server running Scientific Linux 6.6 which seems to be a proper
OS for deployment. Everything installs and I can format the
filesystems on the MDS (1) and OSS (2) servers. When I try and mount
the OST files systems, I get communication errors. I can "lctl ping"
the servers from each other, but cannot establish communication
between the MDS and OSS.
The installation is on servers connected over Infiniband (Qlogic DDR
4X).
In trying to diagnose the issues related to the error messages, I
found mention in some list discussions that o2ib is broken in the
2.6.32-504.8.1 kernel.
After much frustration, I pulled a nightly build from
build.hpdd.intel.com (kernel
2.6.32-573.8.1.el6_lustre.g8438f2a.x86_64) and tried the same set up.
Everything worked as I expected.
Am I missing something? Is the default release pointed to at
https://downloads.hpdd.intel.com/ for 2.7 broken in some way? Is it
just the hardware I am trying to deploy against?
I can provide specifics about the errors I see, I am just posting this
to make sure I am pulling the Lustre RPM's from the proper source.
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
--
Ray Muno
Computer Systems Administrator
e-mail: [email protected]
Phone: (612) 625-9531
FAX: (612) 626-1558
University of Minnesota
Aerospace Engineering and Mechanics Mechanical Engineering
110 Union St. S.E. 111 Church Street SE
Minneapolis, MN 55455 Minneapolis, MN 55455
--
Ray Muno
Computer Systems Administrator
e-mail: [email protected]
Phone: (612) 625-9531
FAX: (612) 626-1558
University of Minnesota
Aerospace Engineering and Mechanics Mechanical Engineering
110 Union St. S.E. 111 Church Street SE
Minneapolis, MN 55455 Minneapolis, MN 55455
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org