This OST is 100% now with only 12GB remaining and something is actively writing to this volume. What would be the appropriate thing to do in this scenario? If I set this to read only on the mds then some of my clients start hanging up.
Should I be running "lfs find -O OST_UID /lustre" and then move the files out of this filesystem and re-add them back? But then there is no gurantee that they will not be written to this specific OST. Any help would be greately appreciated. Thanks, -J On Tue, Feb 15, 2011 at 3:05 PM, Jagga Soorma <[email protected]> wrote: > I might be looking at the wrong OST. What is the best way to map the > actual /dev/mapper/mpath[X] to what OST ID is used for that volume? > > Thanks, > -J > > > On Tue, Feb 15, 2011 at 3:01 PM, Jagga Soorma <[email protected]> wrote: > >> Also, it looks like the client is reporting a different %used compared to >> the oss server itself: >> >> client: >> reshpc101:~ # lfs df -h | grep -i 0007 >> reshpcfs-OST0007_UUID 2.0T 1.7T 202.7G 84% /reshpcfs[OST:7] >> >> oss: >> /dev/mapper/mpath7 2.0T 1.9T 40G 98% /gnet/lustre/oss02/mpath7 >> >> Here is how the data seems to be distributed on one of the OSS's: >> -- >> /dev/mapper/mpath5 2.0T 1.2T 688G 65% /gnet/lustre/oss02/mpath5 >> /dev/mapper/mpath6 2.0T 1.7T 224G 89% /gnet/lustre/oss02/mpath6 >> /dev/mapper/mpath7 2.0T 1.9T 41G 98% /gnet/lustre/oss02/mpath7 >> /dev/mapper/mpath8 2.0T 1.3T 671G 65% /gnet/lustre/oss02/mpath8 >> /dev/mapper/mpath9 2.0T 1.3T 634G 67% /gnet/lustre/oss02/mpath9 >> -- >> >> -J >> >> >> On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma <[email protected]> wrote: >> >>> I did deactivate this OST on the MDS server. So how would I deal with a >>> OST filling up? The OST's don't seem to be filling up evenly either. How >>> does lustre handle a OST that is at 100%? Would it not use this specific >>> OST for writes if there are other OST available with capacity? >>> >>> Thanks, >>> -J >>> >>> >>> On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger >>> <[email protected]>wrote: >>> >>>> On 2011-02-15, at 12:20, Cliff White wrote: >>>> > Client situation depends on where you deactivated the OST - if you >>>> deactivate on the MDS only, clients should be able to read. >>>> > >>>> > What is best to do when an OST fills up really depends on what else >>>> you are doing at the time, and how much control you have over what the >>>> clients are doing and other things. If you can solve the space issue with >>>> a >>>> quick rm -rf, best to leave it online, likewise if all your clients are >>>> trying to bang on it and failing, best to turn things off. YMMV >>>> >>>> In theory, with 1.8 the full OST should be skipped for new object >>>> allocations, but this is not robust in the face of e.g. a single very large >>>> file being written to the OST that takes it from "average" usage to being >>>> full. >>>> >>>> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma <[email protected]> >>>> wrote: >>>> > Hi Guys, >>>> > >>>> > One of my clients got a hung lustre mount this morning and I saw the >>>> following errors in my logs: >>>> > >>>> > -- >>>> > ..snip.. >>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred >>>> while communicating with 10.0.250.47@o2ib3. The ost_write operation >>>> failed with -28 >>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836 >>>> previous similar messages >>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred >>>> while communicating with 10.0.250.47@o2ib3. The ost_write operation >>>> failed with -28 >>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141 >>>> previous similar messages >>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre: >>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request >>>> x1360125198261945 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID >>>> 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline). >>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre: >>>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection to service >>>> reshpcfs-OST0005 via nid 10.0.250.47@o2ib3 was lost; in progress >>>> operations using this service will wait for recovery to complete. >>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred >>>> while communicating with 10.0.250.47@o2ib3. The ost_connect operation >>>> failed with -16 >>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779 >>>> previous similar messages >>>> > Feb 15 10:16:55 reshpc116 kernel: Lustre: >>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request >>>> x1360125198261947 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID >>>> 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline). >>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred >>>> while communicating with 10.0.250.47@o2ib3. The ost_connect operation >>>> failed with -16 >>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous >>>> similar messages >>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error occurred >>>> while communicating with 10.0.250.47@o2ib3. The ost_connect operation >>>> failed with -16 >>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous >>>> similar messages >>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error occurred >>>> while communicating with 10.0.250.47@o2ib3. The ost_connect operation >>>> failed with -16 >>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous >>>> similar messages >>>> > Feb 15 10:31:43 reshpc116 kernel: Lustre: >>>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection restored to service >>>> reshpcfs-OST0005 using nid 10.0.250.47@o2ib3. >>>> > -- >>>> > >>>> > Due to disk space issues on my lustre filesystem one of the OST's were >>>> full and I deactivated that OST this morning. I thought that operation >>>> just >>>> puts it in a read only state and that clients can still access the data >>>> from >>>> that OST. After activating this OST again the client connected again and >>>> was okay after this. How else would you deal with a OST that is close to >>>> 100% full? Is it okay to leave the OST active and the clients will know >>>> not >>>> to write data to that OST? >>>> > >>>> > Thanks, >>>> > -J >>>> > >>>> > _______________________________________________ >>>> > Lustre-discuss mailing list >>>> > [email protected] >>>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> > >>>> > >>>> > _______________________________________________ >>>> > Lustre-discuss mailing list >>>> > [email protected] >>>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> >>>> Cheers, Andreas >>>> -- >>>> Andreas Dilger >>>> Principal Engineer >>>> Whamcloud, Inc. >>>> >>>> >>>> >>>> >>> >> >
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
