On Aug 13, 2010, at 20:49, Adrian Ulrich wrote: > Hi, > > Since a few hours we have a problem with one of our OSTs: > > One (and only one) ll_ost_create_ process on one of the OSTs > seems to go crazy and uses 100% CPU. > > Rebooting the OST + MDS didn't help and there isn't much > going on on the filesystem itself: > > - /proc/fs/lustre/ost/OSS/ost_create/stats is almost 'static' > - iostat shows almost no usage > - ib traffic is < 100 kb/s > > > The MDS logs this each ~3 minutes: > Aug 13 19:11:14 mds1 kernel: LustreError: 11-0: an error occurred while > communicating with 10.201.62...@o2ib. The ost_connect operation failed with > -16 > ..and later: > Aug 13 19:17:16 mds1 kernel: LustreError: > 10253:0:(osc_create.c:390:osc_create()) lustre1-OST0005-osc: oscc recovery > failed: -110 > Aug 13 19:17:16 mds1 kernel: LustreError: > 10253:0:(lov_obd.c:1129:lov_clear_orphans()) error in orphan recovery on OST > idx 5/32: rc = -110 > Aug 13 19:17:16 mds1 kernel: LustreError: > 10253:0:(mds_lov.c:1022:__mds_lov_synchronize()) lustre1-OST0005_UUID failed > at mds_lov_clear_orphans: -110 > Aug 13 19:17:16 mds1 kernel: LustreError: > 10253:0:(mds_lov.c:1031:__mds_lov_synchronize()) lustre1-OST0005_UUID sync > failed -110, deactivating > Aug 13 19:17:54 mds1 kernel: Lustre: > 6544:0:(import.c:508:import_select_connection()) lustre1-OST0005-osc: tried > all connections, increasing latency to 51s > -110 = -ETIMEOUT, operation don't finished before deadline, or network problem.
> oops! (lustre1-OST0005 is hosted on the OSS with the crazy ll_ost_create > process) ll_ost_create work on destroy old created objects, i think. > > On the affected OSS we get > Lustre: 11764:0:(ldlm_lib.c:835:target_handle_connect()) lustre1-OST0005: > refuse reconnection from [email protected]@o2ib to > 0xffff8102164d0200; still busy with 2 active RPCs > > > $ llog_reader lustre-log.1281718692.11833 shows: Llog_reader is tool to read configuration llog, if you want decode debug log, you should use lctl df $file > $output > > And we get tons of soft-cpu lockups :-/ > > Any ideas? please post soft-lookup report. one of possibility, MDS ask too many objects to create on that OST or OST have too many reconnects. > > > Regards, > Adrian > > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
