I would go to see whats in man pages in this case. http://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#idp554480 e.g. If the reported error is anything else (such as -5, "I/O error"), it likely indicates a storage failure. The low-level file system returns this error if it is unable to read from the storage device.
lctl ping <NID> would make us understand if there is anything wrong with n/w communication. As I said, look into server logs, that would help. One more question, I have in my mind is why Lustre 2.3 here ? HTH On 11 April 2014 17:13, Vijay Amirtharaj A <[email protected]>wrote: > Hi Parinay Kondekar, > > Thanks for your reply. > > I am new to lustre, please explain me how to gather the information. > > Regards, > > Vijay Amirtharaj A > > > On Fri, Apr 11, 2014 at 2:55 PM, Parinay Kondekar < > [email protected]> wrote: > >> Apr 11 04:31:19 node16 kernel: LustreError: >> 3185:0:(osc_request.c:1689:osc_brw_redo_request()) >> @@@ redo for recoverable error -5 >> req@ffff8802d1826400x1464726686245296/t0(0) o4->lustre-OST0002-osc- >> [email protected]@o2ib:6/4 lens 488/416 e 0 to 0 dl >> 1397170923 ref 2 fl Interpret:R/0/0 rc -5/-5 >> >> The ost_write operation failed with -5 . o4 = OST_WRITE >> >> The ost_read operation failed with -5 . o3 = OST_READ >> >> #define>EIO> > > 5> /* I/O error */ >> >> IMO, check the n/w, esp between clients and OSS. >> It would be good to know, whats happening on the servers. >> >> >> HTH >> >> >> >> >> >> >> >> On 11 April 2014 14:12, Vijay Amirtharaj A >> <[email protected]>wrote: >> >>> Hi, >>> >>> >>> We have 50 TB storage on lustre, we are using lustre >>> 2.3.0-2.6.32_279.5.1.el6.x86_64.x86_64 OS: Centos 6.3 >>> >>> We have 31 compute nodes. >>> >>> My issue is: >>> >>> When we are restarting storage my jobs are running fine, that is writing >>> with out any issue. >>> >>> After some time, my jobs coming out with this error message: >>> >>> /var/spool/torque/mom_priv/jobs/8321.taavare.tuecms.com.SC: line 10: : >>> No such file or directory >>> -bash: /lustre/home/bala/.bash_profile: Cannot send after transport >>> endpoint shutdown >>> -bash: mpdallexit: command not found >>> >>> Following lustre errors are repeating in computing nodes. >>> >>> Apr 11 04:31:19 node16 kernel: LustreError: 11-0: an error occurred >>> while communicating with 192.168.1.46@o2ib. The ost_write operation >>> failed with -5 >>> Apr 11 04:31:19 node16 kernel: LustreError: Skipped 1 previous similar >>> message >>> Apr 11 04:31:19 node16 kernel: LustreError: >>> 3185:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable >>> error -5 req@ffff8802d1826400 x1464726686245296/t0(0) >>> o4->[email protected]@o2ib:6/4 lens >>> 488/416 e 0 to 0 dl 1397170923 ref 2 fl Interpret:R/0/0 rc -5/-5 >>> Apr 11 04:31:19 node16 kernel: LustreError: >>> 3185:0:(osc_request.c:1689:osc_brw_redo_request()) Skipped 1 previous >>> similar message >>> Apr 11 05:34:07 node16 kernel: LustreError: 11-0: an error occurred >>> while communicating with 192.168.1.44@o2ib. The ost_write operation >>> failed with -5 >>> Apr 11 05:34:07 node16 kernel: LustreError: Skipped 7 previous similar >>> messages >>> Apr 11 05:34:07 node16 kernel: LustreError: >>> 3193:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable >>> error -5 req@ffff88081a33b400 x1464726686360348/t0(0) >>> o4->[email protected]@o2ib:6/4 lens >>> 488/416 e 0 to 0 dl 1397174691 ref 2 fl Interpret:R/0/0 rc -5/-5 >>> Apr 11 05:34:07 node16 kernel: LustreError: >>> 3193:0:(osc_request.c:1689:osc_brw_redo_request()) Skipped 6 previous >>> similar messages >>> Apr 11 05:34:07 node16 kernel: LustreError: 11-0: an error occurred >>> while communicating with 192.168.1.46@o2ib. The ost_write operation >>> failed with -5 >>> Apr 11 05:34:07 node16 kernel: LustreError: Skipped 2 previous similar >>> messages >>> Apr 11 05:34:07 node16 kernel: LustreError: >>> 3199:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable >>> error -5 req@ffff880818b19800 x1464726686360319/t0(0) >>> o4->[email protected]@o2ib:6/4 lens >>> 488/416 e 0 to 0 dl 1397174691 ref 2 fl Interpret:R/0/0 rc -5/-5 >>> Apr 11 05:34:07 node16 kernel: LustreError: >>> 3199:0:(osc_request.c:1689:osc_brw_redo_request()) Skipped 2 previous >>> similar messages >>> Apr 11 05:54:13 node16 kernel: LustreError: 11-0: an error occurred >>> while communicating with 192.168.1.44@o2ib. The ost_write operation >>> failed with -5 >>> Apr 11 05:54:13 node16 kernel: LustreError: Skipped 5 previous similar >>> messages >>> Apr 11 05:54:13 node16 kernel: LustreError: >>> 3193:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable >>> error -5 req@ffff88081a33cc00 x1464726686397633/t0(0) >>> o4->[email protected]@o2ib:6/4 lens >>> 488/416 e 0 to 0 dl 1397175897 ref 2 fl Interpret:R/0/0 rc -5/-5 >>> Apr 11 05:54:13 node16 kernel: LustreError: >>> 3193:0:(osc_request.c:1689:osc_brw_redo_request()) Skipped 5 previous >>> similar messages >>> Apr 11 06:29:25 node16 kernel: LustreError: 11-0: an error occurred >>> while communicating with 192.168.1.45@o2ib. The ost_write operation >>> failed with -5 >>> Apr 11 06:29:25 node16 kernel: LustreError: >>> 3192:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable >>> error -5 req@ffff88081a249400 x1464726686461600/t0(0) >>> o4->[email protected]@o2ib:6/4 lens >>> 488/416 e 0 to 0 dl 1397177972 ref 2 fl Interpret:R/0/0 rc -5/-5 >>> Apr 11 06:29:26 node16 kernel: LustreError: 11-0: an error occurred >>> while communicating with 192.168.1.45@o2ib. The ost_write operation >>> failed with -5 >>> Apr 11 06:29:26 node16 kernel: LustreError: Skipped 4 previous similar >>> messages >>> Apr 11 06:29:26 node16 kernel: LustreError: >>> 3184:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable >>> error -5 req@ffff8807814bac00 x1464726686461778/t0(0) >>> o4->[email protected]@o2ib:6/4 lens >>> 488/416 e 0 to 0 dl 1397177973 ref 2 fl Interpret:R/0/0 rc -5/-5 >>> Apr 11 06:29:26 node16 kernel: LustreError: >>> 3184:0:(osc_request.c:1689:osc_brw_redo_request()) Skipped 4 previous >>> similar messages >>> Apr 11 06:29:28 node16 kernel: LustreError: 11-0: an error occurred >>> while communicating with 192.168.1.45@o2ib. The ost_write operation >>> failed with -5 >>> Apr 11 06:29:28 node16 kernel: LustreError: Skipped 4 previous similar >>> messages >>> Apr 11 06:29:28 node16 kernel: LustreError: >>> 3192:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable >>> error -5 req@ffff88104a184c00 x1464726686461931/t0(0) >>> o4->[email protected]@o2ib:6/4 lens >>> 488/416 e 0 to 0 dl 1397177975 ref 2 fl Interpret:R/0/0 rc -5/-5 >>> Apr 11 06:29:28 node16 kernel: LustreError: >>> 3192:0:(osc_request.c:1689:osc_brw_redo_request()) Skipped 4 previous >>> similar messages >>> Apr 11 07:10:05 node16 kernel: LustreError: 11-0: an error occurred >>> while communicating with 192.168.1.44@o2ib. The ost_write operation >>> failed with -5 >>> Apr 11 07:10:05 node16 kernel: LustreError: Skipped 4 previous similar >>> messages >>> Apr 11 07:10:05 node16 kernel: LustreError: >>> 3185:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable >>> error -5 req@ffff88081a33c800 x1464726686536452/t0(0) >>> o4->[email protected]@o2ib:6/4 lens >>> 488/416 e 0 to 0 dl 1397180449 ref 2 fl Interpret:R/0/0 rc -5/-5 >>> Apr 11 07:10:05 node16 kernel: LustreError: >>> 3185:0:(osc_request.c:1689:osc_brw_redo_request()) Skipped 3 previous >>> similar messages >>> >>> >>> Apr 11 08:34:31 node16 kernel: LustreError: 11-0: an error occurred >>> while communicating with 192.168.1.45@o2ib. The ost_read operation >>> failed with -5 >>> Apr 11 08:34:31 node16 kernel: LustreError: >>> 3193:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable >>> error -5 req@ffff880bb45a4000 x1464726686700285/t0(0) >>> o3->[email protected]@o2ib:6/4 lens >>> 488/400 e 0 to 0 dl 1397185515 ref 2 fl Interpret:R/0/0 rc -5/-5 >>> Apr 11 08:34:57 node16 kernel: LustreError: 11-0: an error occurred >>> while communicating with 192.168.1.45@o2ib. The ost_read operation >>> failed with -5 >>> Apr 11 08:34:57 node16 kernel: LustreError: Skipped 17 previous similar >>> messages >>> Apr 11 08:34:57 node16 kernel: LustreError: >>> 3196:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable >>> error -5 req@ffff881052fe2800 x1464726686701760/t0(0) >>> o3->[email protected]@o2ib:6/4 lens >>> 488/400 e 0 to 0 dl 1397185541 ref 2 fl Interpret:R/0/0 rc -5/-5 >>> Apr 11 08:34:57 node16 kernel: LustreError: >>> 3196:0:(osc_request.c:1689:osc_brw_redo_request()) Skipped 17 previous >>> similar messages >>> Apr 11 08:37:35 node16 mpd: mpd ending mpdid=node16_50196 (inside >>> cleanup) >>> >>> >>> Please help me to solve this issue. >>> >>> Regards, >>> Vijay Amirtharaj A >>> >>> Vijay Amirtharaj A >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> [email protected] >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
