I have this on the debug_file from my OSS: 00000010:02000400:0.0:1382055634.785734:0:3099:0:(ost_handler.c:940:ost_brw_read()) lustre-OST0000: Bulk IO read error with 0afb2e4c-d 870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib), client will retry: rc -107
00000400:02000400:0.0:1382055634.786061:0:3099:0:(watchdog.c:411:lcw_update_time()) Service thread pid 3099 completed after 227.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). But I can read without problems files stored on this ODT from other clients. For example: $ lfs find --obd lustre-OST0000 . ./src/BLAS/srot.f ... $ more ./src/BLAS/srot.f SUBROUTINE SROT(N,SX,INCX,SY,INCY,C,S) * .. Scalar Arguments .. REAL C,S INTEGER INCX,INCY,N * .. * .. Array Arguments .. REAL SX(*),SY(*) ... ... This OSS have 8 ODTs of 14 TB each, with 12 GB/RAM and Xeon Quad Core E5506. Tomorrow I'll increase the memory, if this is the missing resource. 2013/10/17 Joseph Landman <land...@scalableinformatics.com> > Are there device or Filesystem level error messages on the server? This > almost looks like a corrupted file system. > > Please pardon brevity and typos ... Sent from my iPhone > > On Oct 17, 2013, at 6:11 PM, Eduardo Murrieta <emurri...@nucleares.unam.mx> > wrote: > > Hello Jeff, > > Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at > UNAM, we are working on the installation for Alice at DGTIC too, but this > problem is with our local filesystem. > > The OST is connected using a LSI-SAS controller, we have 8 OSTs on the > same server, there are nodes that loose connection with all the OSTs that > belong to this server but the problem is not related with the OST-OSS > communication, since I can access this OST and read files stored there > from other lustre clients. > > The problem is a deadlock condition in which the OSS and some clients > refuse connections from each other as I can see from dmesg: > > in the client > LustreError: 11-0: lustre-OST000a-osc-ffff8801bc548000: Communicating with > 10.2.2.3@o2ib, operation ost_connect failed with -16. > > in the server > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at > 10.2.64.4@o2ib) reconnecting > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at > 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs > > this only happen with clients that are reading a lot of small files > (~100MB each) in the same OST. > > thank you, > > Eduardo > > > > 2013/10/17 Jeff Johnson <jeff.john...@aeoncomputing.com> > >> Hola Eduardo, >> >> How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)? >> Are there any non-Lustre errors in the dmesg output of the OSS? >> Block devices error on the OSS (/dev/sd?)? >> >> If you are losing [scsi,sas,fc,srp] connectivity you may see this sort >> of thing. If the OSTs are connected to the OSS node via IB SRP and your >> IB fabric gets busy or you have subnet manager issues you might see a >> condition like this. >> >> Is this the AliceFS at DGTIC? >> >> --Jeff >> >> >> >> On 10/17/13 3:52 PM, Eduardo Murrieta wrote: >> > Hello, >> > >> > this is my first post on this list, I hope someone can give me some >> > advise on how to resolve the following issue. >> > >> > I'm using the lustre release 2.4.0 RC2 compiled from whamcloud >> > sources, this is an upgrade from lustre 2.2.22 from same sources. >> > >> > The situation is: >> > >> > There are several clients reading files that belongs mostly to the >> > same OST, afther a period of time the clients starts loosing contact >> > with this OST and processes stops due to this fault, here is the state >> > for such OST on one client: >> > >> > client# lfs check servers >> > ... >> > ... >> > lustre-OST000a-osc-ffff8801bc548000: check error: Resource temporarily >> > unavailable >> > ... >> > ... >> > >> > checking dmesg on client and OSS server we have: >> > >> > client# dmesg >> > LustreError: 11-0: lustre-OST000a-osc-ffff8801bc548000: Communicating >> > with 10.2.2.3@o2ib, operation ost_connect failed with -16. >> > LustreError: Skipped 24 previous similar messages >> > >> > OSS-server# dmesg >> > .... >> > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 >> > (at 10.2.64.4@o2ib) reconnecting >> > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 >> > (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs >> > .... >> > >> > At this moment I can ping from client to server and vice versa, but >> > some time this call also hangs on server and client. >> > >> > client# # lctl ping OSS-server@o2ib >> > 12345-0@lo >> > 12345-OSS-server@o2ib >> > >> > OSS-server# lctl ping 10.2.64.4@o2ib >> > 12345-0@lo >> > 12345-10.2.64.4@o2ib >> > >> > This situation happens very frequently and specially with jobs that >> > process a lot of files in an average size of 100MB. >> > >> > The only solution that I find to reestablish the communication >> > between the server and the client is restarting both machines. >> > >> > I hope some have an idea what is the reason for the problem and how >> > can I reset the communication with the clients without restarting the >> > machines. >> > >> > thank you, >> > >> > Eduardo >> > UNAM@Mexico >> > >> > -- >> > Eduardo Murrieta >> > Unidad de Cómputo >> > Instituto de Ciencias Nucleares, UNAM >> > Ph. +52-55-5622-4739 ext. 5103 >> > >> > >> > >> > _______________________________________________ >> > Lustre-discuss mailing list >> > Lustre-discuss@lists.lustre.org >> > http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> -- >> ------------------------------ >> Jeff Johnson >> Co-Founder >> Aeon Computing >> >> jeff.john...@aeoncomputing.com >> www.aeoncomputing.com >> t: 858-412-3810 x1001 f: 858-412-3845 >> m: 619-204-9061 >> >> 4170 Morena Boulevard, Suite D - San Diego, CA 92117 >> >> High-performance Computing / Lustre Filesystems / Scale-out Storage >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss@lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > > > -- > Eduardo Murrieta > Unidad de Cómputo > Instituto de Ciencias Nucleares, UNAM > Ph. +52-55-5622-4739 ext. 5103 > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > -- Eduardo Murrieta Unidad de Cómputo Instituto de Ciencias Nucleares, UNAM Ph. +52-55-5622-4739 ext. 5103
_______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss