Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

Jeff Johnson Thu, 17 Oct 2013 19:19:34 -0700

Ah, I understand. I performed the onsite Lustre installation of Alice and
worked with JLG and his staff. Nice group of people!


This seems like a backend issue. Ldiskfs or the LSI RAID devices. Do you
see any read/write failures reported on the OSS of the sd block devices
where the OSTs reside? Something is timing out; disk I/O or the OSS is
running too high of an iowait under load.

How many OSS nodes in the filesystem? Are these operations striped across
all OSTs? Across multiple OSSs?

I still have an account on DGTIC's gateway, I could login and look. :-)

--Jeff

On Thursday, October 17, 2013, Eduardo Murrieta wrote:

> Hello Jeff,
>
> Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at
> UNAM, we are working on the installation for Alice at DGTIC too, but this
> problem is with our local filesystem.
>
> The OST is connected using a LSI-SAS controller, we have 8 OSTs on the
> same server, there are nodes that loose connection with all the OSTs that
> belong to this server but the problem is not related with the OST-OSS
> communication, since I can access this  OST and read files stored there
> from other lustre clients.
>
> The problem is a deadlock condition in which the OSS and some clients
> refuse connections from each other as I can see from dmesg:
>
> in the client
> LustreError: 11-0: lustre-OST000a-osc-ffff8801bc548000: Communicating with
> 10.2.2.3@o2ib, operation ost_connect failed with -16.
>
> in the server
> Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
> 10.2.64.4@o2ib) reconnecting
> Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
> 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs
>
> this only happen with clients that are reading a lot of small files
> (~100MB each) in the same OST.
>
> thank you,
>
> Eduardo
>
>
>
> 2013/10/17 Jeff Johnson <jeff.john...@aeoncomputing.com>
>
> Hola Eduardo,
>
> How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)?
> Are there any non-Lustre errors in the dmesg output of the OSS?
> Block devices error on the OSS (/dev/sd?)?
>
> If you are losing [scsi,sas,fc,srp] connectivity you may see this sort
> of thing. If the OSTs are connected to the OSS node via IB SRP and your
> IB fabric gets busy or you have subnet manager issues you might see a
> condition like this.
>
> Is this the AliceFS at DGTIC?
>
> --Jeff
>
>
>
> On 10/17/13 3:52 PM, Eduardo Murrieta wrote:
> > Hello,
> >
> > this is my first post on this list, I hope someone can give me some
> > advise on how to resolve the following issue.
> >
> > I'm using the lustre release 2.4.0 RC2 compiled from whamcloud
> > sources, this is an upgrade from lustre 2.2.22 from same sources.
> >
> > The situation is:
> >
> > There are several clients reading files that belongs mostly to the
> > same OST, afther a period of time the clients starts loosing contact
> > with this OST and processes stops due to this fault, here is the state
> > for such OST on one client:
> >
> > client# lfs check servers
> > ...
> > ...
> > lustre-OST000a-osc-ffff8801bc548000: check error: Resource temporarily
> > unavailable
> > ...
> > ...
> >
> > checking dmesg on client and OSS server we have:
> >
> > client# dmesg
> > LustreError: 11-0: lustre-OST000a-osc-ffff8801bc548000: Communicating
> > with 10.2.2.3@o2ib, operation ost_connect failed with -16.
> > LustreError: Skipped 24 previous similar messages
> >
> > OSS-server# dmesg
> > ....
> > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9
> > (at 10.2.64.4@o2ib) reconnecting
> > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9
> > (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs
> > ....
> >
> > At this moment I can ping from client to server and vice versa, but
> > some time this call also hangs on server and client.
> >
> > client# # lctl ping OSS-server@o2ib
> > 12345-0@lo
> > 12345-OSS-server@o2ib
> >
> > OSS-server# lctl ping 10.2.64.4@o2ib
> > 12345-0@lo
> > 12345-10.2.64.4@o2ib
> >
> > This situation happens very frequently and specially with jobs that
> > process a lot of files in an average size of 100MB.
> >
> > The only solution that  I find to reestablish the communication
> > between the server and the client is restarting both machines.
> >
> > I hope some have an idea what is the reason for the problem and how
> > can I reset the communication with the clients without restarting the
> > machines.
> >
> > thank you,
> >
> > Eduardo
> > UNAM@Mexico
> >
> > --
> > Eduardo Murrieta
> > Unidad de Cómputo
> > Instituto de Ciencias Nucleares, UNAM
> > Ph. +52-55-5622-4739 ext. 5103
> >
> >
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
> --
> ------------------------------
> Jeff Johnson
> Co-Founder
> Aeon Computing
>
> jeff.john...@aeoncomputing.com
> www.aeoncomputing.com
> t: 858-412-3810 x1001   f: 858-412-3845
> m: 619-204-9061
>
> 4170 Morena Boulevard, Suite D - San Diego, CA 92117
>
> High-performance Computing / Lustre Filesystems / Scale-out Storage
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>
>
> --
> Eduardo Murrieta
>


-- 
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite D - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

Reply via email to