Hello all,
We are running 1.6.6 with a shared mgs/mdt and 3 ost's. We run a set of tests that write heavily, then we review the results and delete the data. Usually the load is evenly spread across all 3 ost's. I noticed this afternoon that the load does not seem to be distributed. OST0000 has a load of 50+ with iowait of around 10% OST0001 has a load of <1 with >99% idle OST0002 has a load of <1 with >99% idle >From a client all 3 OST's appear online: [aever...@englogin01 ~]$ lctl device_list 0 UP mgc mgc172.16.14...@tcp 19dde65d-8eba-22b0-b618-f59bfbd36cde 5 1 UP lov fortefs-clilov-f7cc4800 c86e2947-f2bf-5e47-541f-6ff3f13af9a0 4 2 UP mdc fortefs-MDT0000-mdc-f7cc4800 c86e2947-f2bf-5e47-541f-6ff3f13af9a0 5 3 UP osc fortefs-OST0000-osc-f7cc4800 c86e2947-f2bf-5e47-541f-6ff3f13af9a0 5 4 UP osc fortefs-OST0001-osc-f7cc4800 c86e2947-f2bf-5e47-541f-6ff3f13af9a0 5 5 UP osc fortefs-OST0002-osc-f7cc4800 c86e2947-f2bf-5e47-541f-6ff3f13af9a0 5 [aever...@englogin01 ~]$ >From MGS/MDT claims Lustre is healthy: [aever...@lustrefs ~]$ cat /proc/fs/lustre/health_check healthy [aever...@lustrefs ~]$ df confirms the lopsided writes: OST0000: Filesystem Size Used Avail Use% Mounted on /dev/sdb1 1.2T 602G 544G 53% /mnt/fortefs/ost0 OST0001: Filesystem Size Used Avail Use% Mounted on /dev/sdb1 1.2T 317G 828G 28% /mnt/fortefs/ost0 OST0002: Filesystem Size Used Avail Use% Mounted on /dev/sdb1 1.2T 315G 831G 28% /mnt/fortefs/ost0 What else should I be checking? Has the MGS/MDT lost track of OST0001 and OST0002 somehow? Clients can still read data that is on OST0001 and OST0002. I confirmed this using lfs getstripe and cat'ing files on those devices. If I edit the file, the file is written to OST0000. Regards, Aaron
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
