Hello! On Sep 17, 2009, at 7:28 AM, Lukas Hejtmanek wrote: > LustreError: 11-0: an error occurred while communicating with > x.x....@tcp. > The mds_connect operation failed with -16 > Lustre: Request x112815827 sent from stable-OST0001-osc- > ffff8802855b7800 to > NID x.x....@tcp 100s ago has timed out (limit 100s).
This looks like your OSTs are overloaded (do you get any "slow ..." messages in the logs there?, watchdog triggers?) dragging down MDS with them (trying to do e.g. creates which is slow and so client times out from MDS as well, though you did not show it in your log - we see MDS refuses client connection because it thinks it is still processing a request from this client). The spurious eviction is addressed by adaptive timeouts (enabled by default in 1.8). If you bring down the load on the OSTs (read this list, recently there were several methods discussed like bringing down number of service threads) that should help. > LustreError: 166-1: mgcx.x....@tcp: Connection to service MGS via nid > x.x....@tcp was lost; in progress operations using this service will > fail. > Lustre: mgcx.x....@tcp: Reactivating import Now this is unexpected and I do not see a timeout so I do not know what actually happened there. Bye, Oleg _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss