Ah, I understand. I performed the onsite Lustre installation of Alice and worked with JLG and his staff. Nice group of people!
This seems like a backend issue. Ldiskfs or the LSI RAID devices. Do you see any read/write failures reported on the OSS of the sd block devices where the OSTs reside? Something is timing out; disk I/O or the OSS is running too high of an iowait under load. How many OSS nodes in the filesystem? Are these operations striped across all OSTs? Across multiple OSSs? I still have an account on DGTIC's gateway, I could login and look. :-) --Jeff On Thursday, October 17, 2013, Eduardo Murrieta wrote: > Hello Jeff, > > Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at > UNAM, we are working on the installation for Alice at DGTIC too, but this > problem is with our local filesystem. > > The OST is connected using a LSI-SAS controller, we have 8 OSTs on the > same server, there are nodes that loose connection with all the OSTs that > belong to this server but the problem is not related with the OST-OSS > communication, since I can access this OST and read files stored there > from other lustre clients. > > The problem is a deadlock condition in which the OSS and some clients > refuse connections from each other as I can see from dmesg: > > in the client > LustreError: 11-0: lustre-OST000a-osc-ffff8801bc548000: Communicating with > 10.2.2.3@o2ib, operation ost_connect failed with -16. > > in the server > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at > 10.2.64.4@o2ib) reconnecting > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at > 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs > > this only happen with clients that are reading a lot of small files > (~100MB each) in the same OST. > > thank you, > > Eduardo > > > > 2013/10/17 Jeff Johnson <jeff.john...@aeoncomputing.com> > > Hola Eduardo, > > How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)? > Are there any non-Lustre errors in the dmesg output of the OSS? > Block devices error on the OSS (/dev/sd?)? > > If you are losing [scsi,sas,fc,srp] connectivity you may see this sort > of thing. If the OSTs are connected to the OSS node via IB SRP and your > IB fabric gets busy or you have subnet manager issues you might see a > condition like this. > > Is this the AliceFS at DGTIC? > > --Jeff > > > > On 10/17/13 3:52 PM, Eduardo Murrieta wrote: > > Hello, > > > > this is my first post on this list, I hope someone can give me some > > advise on how to resolve the following issue. > > > > I'm using the lustre release 2.4.0 RC2 compiled from whamcloud > > sources, this is an upgrade from lustre 2.2.22 from same sources. > > > > The situation is: > > > > There are several clients reading files that belongs mostly to the > > same OST, afther a period of time the clients starts loosing contact > > with this OST and processes stops due to this fault, here is the state > > for such OST on one client: > > > > client# lfs check servers > > ... > > ... > > lustre-OST000a-osc-ffff8801bc548000: check error: Resource temporarily > > unavailable > > ... > > ... > > > > checking dmesg on client and OSS server we have: > > > > client# dmesg > > LustreError: 11-0: lustre-OST000a-osc-ffff8801bc548000: Communicating > > with 10.2.2.3@o2ib, operation ost_connect failed with -16. > > LustreError: Skipped 24 previous similar messages > > > > OSS-server# dmesg > > .... > > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 > > (at 10.2.64.4@o2ib) reconnecting > > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 > > (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs > > .... > > > > At this moment I can ping from client to server and vice versa, but > > some time this call also hangs on server and client. > > > > client# # lctl ping OSS-server@o2ib > > 12345-0@lo > > 12345-OSS-server@o2ib > > > > OSS-server# lctl ping 10.2.64.4@o2ib > > 12345-0@lo > > 12345-10.2.64.4@o2ib > > > > This situation happens very frequently and specially with jobs that > > process a lot of files in an average size of 100MB. > > > > The only solution that I find to reestablish the communication > > between the server and the client is restarting both machines. > > > > I hope some have an idea what is the reason for the problem and how > > can I reset the communication with the clients without restarting the > > machines. > > > > thank you, > > > > Eduardo > > UNAM@Mexico > > > > -- > > Eduardo Murrieta > > Unidad de Cómputo > > Instituto de Ciencias Nucleares, UNAM > > Ph. +52-55-5622-4739 ext. 5103 > > > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss@lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > -- > ------------------------------ > Jeff Johnson > Co-Founder > Aeon Computing > > jeff.john...@aeoncomputing.com > www.aeoncomputing.com > t: 858-412-3810 x1001 f: 858-412-3845 > m: 619-204-9061 > > 4170 Morena Boulevard, Suite D - San Diego, CA 92117 > > High-performance Computing / Lustre Filesystems / Scale-out Storage > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > -- > Eduardo Murrieta > -- ------------------------------ Jeff Johnson Co-Founder Aeon Computing jeff.john...@aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 x1001 f: 858-412-3845 m: 619-204-9061 4170 Morena Boulevard, Suite D - San Diego, CA 92117 High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss