Maybe, clients should mount the file system with "localflock" parameter? Please check the manual for information about this, but I think it was the same problem we had a while back where a dynamic link was failing.
bob On 2/9/2011 7:24 PM, James Robnett wrote: >> Normally I've had no problems but recently I have multiple clients >> reporting the following error: >> >> LustreError: 3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo >> for recoverable error req@ffff8101ae084000 x1358858531428366/t60136289752 >> o4->[email protected]@o2ib:6/4 lens 448/608 e 0 to 1 dl >> 1297285890 ref 2 fl Interpret:R/0/0 rc 0/0 >> >> which in turn appears to generate a premature EOF on our user software. >> >> There are no corresponding errors on the servers. > The above is not true. There are apparently corresponding errors of > the form: > > Feb 9 17:05:21 lustre-oss-1 kernel: LustreError: > 2964:0:(ost_handler.c:1038:ost_brw_write()) client csum f00001, server > csum 964d53e2 > Feb 9 17:05:21 lustre-oss-1 kernel: LustreError: > 2964:0:(ost_handler.c:1038:ost_brw_write()) Skipped 43 previous similar > messages > Feb 9 17:05:21 lustre-oss-1 kernel: LustreError: 168-f: lustre-OST0000: > BAD WRITE CHECKSUM: changed in transit before arrival at OST from > 12345-10.64.1.212@tcp inum 2981338/1802650709 object 8183950/0 extent > [2384461824-2385510399] > Feb 9 17:05:21 lustre-oss-1 kernel: LustreError: Skipped 43 previous > similar messages > Feb 9 17:05:21 lustre-oss-1 kernel: LustreError: > 2964:0:(ost_handler.c:1100:ost_brw_write()) client csum f00001, original > server csum 964d53e2, server csum now 964d53e2 > Feb 9 17:05:21 lustre-oss-1 kernel: LustreError: > 2964:0:(ost_handler.c:1100:ost_brw_write()) Skipped 43 previous similar > messages > Feb 9 17:10:22 lustre-oss-1 kernel: LustreError: > 3035:0:(ost_handler.c:1038:ost_brw_write()) client csum f00001, server > csum 180cd9bd > Feb 9 17:10:22 lustre-oss-1 kernel: LustreError: > 3035:0:(ost_handler.c:1038:ost_brw_write()) Skipped 63 previous similar > messages > Feb 9 17:10:22 lustre-oss-1 kernel: LustreError: 168-f: lustre-OST0000: > BAD WRITE CHECKSUM: changed in transit before arrival at OST from > 12345-10.64.1.212@tcp inum 2981338/1802650709 object 8183950/0 extent > [4355784704-4356833279] > Feb 9 17:10:22 lustre-oss-1 kernel: LustreError: Skipped 63 previous > similar messages > Feb 9 17:10:22 lustre-oss-1 kernel: LustreError: > 3035:0:(ost_handler.c:1100:ost_brw_write()) client csum f00001, original > server csum 180cd9bd, server csum now 180cd9bd > Feb 9 17:10:22 lustre-oss-1 kernel: LustreError: > 3035:0:(ost_handler.c:1100:ost_brw_write()) Skipped 63 previous similar > messages > > The other OSS shows similar errors. We are doing mmap I/O and a > search implies those errors are related to mmap I/O. > > I'm open to suggestions, in the meantime the userspace code can be > switched from mmap to regular file I/O via an rc file so we'll try that > and see if it at least makes the errors go away. > > James > > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
