Also, it looks like the client is reporting a different %used compared to the oss server itself:
client: reshpc101:~ # lfs df -h | grep -i 0007 reshpcfs-OST0007_UUID 2.0T 1.7T 202.7G 84% /reshpcfs[OST:7] oss: /dev/mapper/mpath7 2.0T 1.9T 40G 98% /gnet/lustre/oss02/mpath7 Here is how the data seems to be distributed on one of the OSS's: -- /dev/mapper/mpath5 2.0T 1.2T 688G 65% /gnet/lustre/oss02/mpath5 /dev/mapper/mpath6 2.0T 1.7T 224G 89% /gnet/lustre/oss02/mpath6 /dev/mapper/mpath7 2.0T 1.9T 41G 98% /gnet/lustre/oss02/mpath7 /dev/mapper/mpath8 2.0T 1.3T 671G 65% /gnet/lustre/oss02/mpath8 /dev/mapper/mpath9 2.0T 1.3T 634G 67% /gnet/lustre/oss02/mpath9 -- -J On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma <[email protected]> wrote: > I did deactivate this OST on the MDS server. So how would I deal with a > OST filling up? The OST's don't seem to be filling up evenly either. How > does lustre handle a OST that is at 100%? Would it not use this specific > OST for writes if there are other OST available with capacity? > > Thanks, > -J > > > On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger <[email protected]>wrote: > >> On 2011-02-15, at 12:20, Cliff White wrote: >> > Client situation depends on where you deactivated the OST - if you >> deactivate on the MDS only, clients should be able to read. >> > >> > What is best to do when an OST fills up really depends on what else you >> are doing at the time, and how much control you have over what the clients >> are doing and other things. If you can solve the space issue with a quick >> rm -rf, best to leave it online, likewise if all your clients are trying to >> bang on it and failing, best to turn things off. YMMV >> >> In theory, with 1.8 the full OST should be skipped for new object >> allocations, but this is not robust in the face of e.g. a single very large >> file being written to the OST that takes it from "average" usage to being >> full. >> >> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma <[email protected]> >> wrote: >> > Hi Guys, >> > >> > One of my clients got a hung lustre mount this morning and I saw the >> following errors in my logs: >> > >> > -- >> > ..snip.. >> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred >> while communicating with 10.0.250.47@o2ib3. The ost_write operation >> failed with -28 >> > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836 previous >> similar messages >> > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred >> while communicating with 10.0.250.47@o2ib3. The ost_write operation >> failed with -28 >> > Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141 previous >> similar messages >> > Feb 15 10:16:54 reshpc116 kernel: Lustre: >> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request >> x1360125198261945 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID >> 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline). >> > Feb 15 10:16:54 reshpc116 kernel: Lustre: >> reshpcfs-OST0005-osc-ffff8830175c8400: Connection to service >> reshpcfs-OST0005 via nid 10.0.250.47@o2ib3 was lost; in progress >> operations using this service will wait for recovery to complete. >> > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred >> while communicating with 10.0.250.47@o2ib3. The ost_connect operation >> failed with -16 >> > Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779 previous >> similar messages >> > Feb 15 10:16:55 reshpc116 kernel: Lustre: >> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request >> x1360125198261947 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID >> 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline). >> > Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred >> while communicating with 10.0.250.47@o2ib3. The ost_connect operation >> failed with -16 >> > Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous >> similar messages >> > Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error occurred >> while communicating with 10.0.250.47@o2ib3. The ost_connect operation >> failed with -16 >> > Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous >> similar messages >> > Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error occurred >> while communicating with 10.0.250.47@o2ib3. The ost_connect operation >> failed with -16 >> > Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous >> similar messages >> > Feb 15 10:31:43 reshpc116 kernel: Lustre: >> reshpcfs-OST0005-osc-ffff8830175c8400: Connection restored to service >> reshpcfs-OST0005 using nid 10.0.250.47@o2ib3. >> > -- >> > >> > Due to disk space issues on my lustre filesystem one of the OST's were >> full and I deactivated that OST this morning. I thought that operation just >> puts it in a read only state and that clients can still access the data from >> that OST. After activating this OST again the client connected again and >> was okay after this. How else would you deal with a OST that is close to >> 100% full? Is it okay to leave the OST active and the clients will know not >> to write data to that OST? >> > >> > Thanks, >> > -J >> > >> > _______________________________________________ >> > Lustre-discuss mailing list >> > [email protected] >> > http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > >> > >> > _______________________________________________ >> > Lustre-discuss mailing list >> > [email protected] >> > http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Principal Engineer >> Whamcloud, Inc. >> >> >> >> >
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
