CliffW: This helps out a lot!
We still have problems determining devices. We don't know what their numbers are (I been using lctl dl), but I don't know how to activate or deactivate them. Do you have an example? TIA On Thu, Aug 7, 2008 at 10:59 AM, Cliff White <[EMAIL PROTECTED]> wrote: > Mag Gam wrote: >> >> We do a lot of fluid simulations at my university, but on a similar >> note I would like to know what the Lustre experts will do in >> particular simulated scenarios... >> >> The environment is this: >> 30 Servers (All Linux) >> 1000+ Clients (All Linux) >> >> 30 Servers >> 1 MDS >> 30 OSTs each with 2TB of storage >> >> No fail over capabilities. >> >> >> Scenario 1: >> Your client is trying to mount lustre filesystem using lustre module, >> and it hung. Do what? > > Answer 0 to all questions: > "Read the Lustre Manual. File doc bugs in Lustre Bugzilla if there's a part > you don't understand, or a part missing" > > Answer 1 for all your questions. > "Check syslogs/consoles on the impacted clients. > Check syslogs/consoles on _all lustre servers. > Pay careful attention to timestamps. > Work backwards to the first error." > > Is the problem restricted to one client or seen by multiple clients? > If multiple clients, start with the network, use lctl ping to check lustre > connectivity. > If a single client, it's generally a client config/network config issue. >> >> Scenario 2: >> Your MDS won't mount up. Its saying, "The server is already running". >> You try to mount it up couple of times and still its not > > Be certain the server is not already running. > Be certain no hung mount processes exist. > Unload all lustre modules (lustre_rmmod script will do this) > Retry and -> answer 1 > >> >> Scenario 3: >> OST/OSS reboots due to a power outage. Some files are striped on this, >> and some aren't What happens? What to do for minimal outage? > > - Clients can be mounted with a dead OST using the exclude options to the > mount command. lfs getstripe can be run from clients to find files > on the bad OST. See answer 0 for detailed process. >> >> Scenario 4: >> lctl dl shows some devices in "ST" state. What does that mean, and how >> do I clear it? > > ST = stopped. > Clear this by cleaning up all devices (answer 0) > or restarting the stopped devices. > Usually indicates an error/issue with the stopped device, so see > answer 1. >> >> >> I know some of these scenarios may be ambiguous, but please let me >> know which so I can further elaborate. I am eventually planning to >> wiki this for future reference and other lustre newbies. > > Please contribute to wiki.lustre.org - there is considerable information > there already, and a decent existing structure. >> >> If anyone else has any other scenarios, please don't be shy and ask >> away. We can create a good trouble shooting doc similar to the >> operations manual. > > Again, please file doc bugs at bugzilla.lustre.org and contribute to > wiki.lustre.org, hope this helps! > cliffw > >> >> >> TIA >> _______________________________________________ >> Lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
