Eduardo, One or two E5506 CPUs in the OSS? What is the specific LSI controller and how many of them in the OSS?
I think the OSS is under provisioned for 8 OSTs. I'm betting you run a high iowait on those sd devices during your problematic run. The iowait probably grows until deadlock. Can you run the job while running a shell with top on the OSS. You're likely hitting 99% iowait. --Jeff On Thursday, October 17, 2013, Eduardo Murrieta wrote: > I have this on the debug_file from my OSS: > > 00000010:02000400:0.0:1382055634.785734:0:3099:0:(ost_handler.c:940:ost_brw_read()) > lustre-OST0000: Bulk IO read error with 0afb2e4c-d > 870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib), client will retry: rc -107 > > 00000400:02000400:0.0:1382055634.786061:0:3099:0:(watchdog.c:411:lcw_update_time()) > Service thread pid 3099 completed after 227.00s. This indicates the system > was overloaded (too many service threads, or there were not enough hardware > resources). > > But I can read without problems files stored on this ODT from other > clients. For example: > > $ lfs find --obd lustre-OST0000 . > ./src/BLAS/srot.f > ... > > $ more ./src/BLAS/srot.f > SUBROUTINE SROT(N,SX,INCX,SY,INCY,C,S) > * .. Scalar Arguments .. > REAL C,S > INTEGER INCX,INCY,N > * .. > * .. Array Arguments .. > REAL SX(*),SY(*) > ... > ... > > This OSS have 8 ODTs of 14 TB each, with 12 GB/RAM and Xeon Quad Core > E5506. Tomorrow I'll increase the memory, if this is the missing resource. > > > > > > > > > > 2013/10/17 Joseph Landman <[email protected]> > > Are there device or Filesystem level error messages on the server? This > almost looks like a corrupted file system. > > Please pardon brevity and typos ... Sent from my iPhone > > On Oct 17, 2013, at 6:11 PM, Eduardo Murrieta <[email protected]> > wrote: > > Hello Jeff, > > Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at > UNAM, we are working on the installation for Alice at DGTIC too, but this > problem is with our local filesystem. > > The OST is connected using a LSI-SAS controller, we have 8 OSTs on the > same server, there are nodes that loose connection with all the OSTs that > belong to this server but the problem is not related with the OST-OSS > communication, since I can access this OST and read files stored there > from other lustre clients. > > The problem is a deadlock condition in which the OSS and some clients > refuse connections from each other as I can see from dmesg: > > in the client > LustreError: 11-0: lustre-OST000a-osc-ffff8801bc548000: Communicating with > 10.2.2.3@o2ib, operation ost_connect failed with -16. > > in the server > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at > 10.2.64.4@o2ib) reconnecting > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at > 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs > > this only happen with clients that are reading a lot of small files > (~100MB each) in the same OST. > > thank you, > > Eduardo > > > > 2013/10/17 Jeff Johnson <[email protected]> > > Hola Eduardo, > > How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)? > Are there any non-Lustre errors in the dmesg output of the OSS? > Block devices error on the OSS (/dev/sd?)? > > If you are losing [scsi,sas,fc,srp] connectivity you may see this sort > of thing. If the OSTs are connected to the OSS node via IB SRP and your > IB fabric gets busy or you have subnet manager issues you might see a > condition like this. > > Is this the AliceFS at DGTIC? > > --Jeff > > > > On 10/17/13 3:52 PM, Eduardo Murrieta wrote: > > Hello, > > > > this is my first post on this list, I hope someone can give me some > > advise on how to resolve the following issue. > > > > I'm using the lustre release 2.4.0 RC2 compiled from whamcloud > > sources, this is an upgrade from lustre 2.2.22 from same sources. > > > > The situation is: > > > > There are several clients reading files that belongs mostly to the > > same OST, afther a period of time the clients starts loosing contact > > with this OST and processes stops due to this fault, here is the state > > for such OST on one client: > > > > client# lfs check servers > > ... > > ... > > lustre-OST000a-osc-ffff8801bc548000: check error: Resource temporarily > > unavailable > > ... > > ... > > > > checking dmesg on client and OSS server we have: > > > > client# dmesg > > LustreError: 11-0: lustre-OST000a-osc-ffff8801bc548000: Communicating > > with 10.2.2.3@o2ib, operation ost_connect failed with -16. > > LustreError: Skipped 24 previous similar messages > > > > OSS-server# dmesg > > .... > > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 > > (at 10.2.64.4@o2ib) reconnecting > > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 > > (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs > > .... > > > > At this moment I can ping from client to server and vice versa, but > > some time this call also hangs on server and client. > > > > client# # lctl ping OSS-server@o2ib > > 12345-0@lo > > 12345-OSS-server@o2ib > > > > OSS-server# lctl ping 10.2.64.4@o2ib > > 12345-0@lo > > 1234 > > -- ------------------------------ Jeff Johnson Co-Founder Aeon Computing [email protected] www.aeoncomputing.com t: 858-412-3810 x1001 f: 858-412-3845 m: 619-204-9061 4170 Morena Boulevard, Suite D - San Diego, CA 92117 High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
