Re: [Lustre-discuss] Lustre client error

Jagga Soorma Tue, 15 Feb 2011 15:01:55 -0800

Also, it looks like the client is reporting a different %used compared to
the oss server itself:


client:
reshpc101:~ # lfs df -h | grep -i 0007
reshpcfs-OST0007_UUID      2.0T      1.7T    202.7G   84% /reshpcfs[OST:7]

oss:
/dev/mapper/mpath7    2.0T  1.9T   40G  98% /gnet/lustre/oss02/mpath7

Here is how the data seems to be distributed on one of the OSS's:
--
/dev/mapper/mpath5    2.0T  1.2T  688G  65% /gnet/lustre/oss02/mpath5
/dev/mapper/mpath6    2.0T  1.7T  224G  89% /gnet/lustre/oss02/mpath6
/dev/mapper/mpath7    2.0T  1.9T   41G  98% /gnet/lustre/oss02/mpath7
/dev/mapper/mpath8    2.0T  1.3T  671G  65% /gnet/lustre/oss02/mpath8
/dev/mapper/mpath9    2.0T  1.3T  634G  67% /gnet/lustre/oss02/mpath9
--

-J

On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma <[email protected]> wrote:

> I did deactivate this OST on the MDS server.  So how would I deal with a
> OST filling up?  The OST's don't seem to be filling up evenly either.  How
> does lustre handle a OST that is at 100%?  Would it not use this specific
> OST for writes if there are other OST available with capacity?
>
> Thanks,
> -J
>
>
> On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger <[email protected]>wrote:
>
>> On 2011-02-15, at 12:20, Cliff White wrote:
>> > Client situation depends on where you deactivated the OST - if you
>> deactivate on the MDS only, clients should be able to read.
>> >
>> > What is best to do when an OST fills up really depends on what else you
>> are doing at the time, and how much control you have over what the clients
>> are doing and other things.  If you can solve the space issue with a quick
>> rm -rf, best to leave it online, likewise if all your clients are trying to
>> bang on it and failing, best to turn things off. YMMV
>>
>> In theory, with 1.8 the full OST should be skipped for new object
>> allocations, but this is not robust in the face of e.g. a single very large
>> file being written to the OST that takes it from "average" usage to being
>> full.
>>
>> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma <[email protected]>
>> wrote:
>> > Hi Guys,
>> >
>> > One of my clients got a hung lustre mount this morning and I saw the
>> following errors in my logs:
>> >
>> > --
>> > ..snip..
>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred
>> while communicating with 10.0.250.47@o2ib3. The ost_write operation
>> failed with -28
>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836 previous
>> similar messages
>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred
>> while communicating with 10.0.250.47@o2ib3. The ost_write operation
>> failed with -28
>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141 previous
>> similar messages
>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>> x1360125198261945 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID
>> 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection to service
>> reshpcfs-OST0005 via nid 10.0.250.47@o2ib3 was lost; in progress
>> operations using this service will wait for recovery to complete.
>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred
>> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
>> failed with -16
>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779 previous
>> similar messages
>> > Feb 15 10:16:55 reshpc116 kernel: Lustre:
>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>> x1360125198261947 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID
>> 10.0.250.47@o2ib3 1344s ago has timed out (1344s prior to deadline).
>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred
>> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
>> failed with -16
>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous
>> similar messages
>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error occurred
>> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
>> failed with -16
>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous
>> similar messages
>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error occurred
>> while communicating with 10.0.250.47@o2ib3. The ost_connect operation
>> failed with -16
>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous
>> similar messages
>> > Feb 15 10:31:43 reshpc116 kernel: Lustre:
>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection restored to service
>> reshpcfs-OST0005 using nid 10.0.250.47@o2ib3.
>> > --
>> >
>> > Due to disk space issues on my lustre filesystem one of the OST's were
>> full and I deactivated that OST this morning.  I thought that operation just
>> puts it in a read only state and that clients can still access the data from
>> that OST.  After activating this OST again the client connected again and
>> was okay after this.  How else would you deal with a OST that is close to
>> 100% full?  Is it okay to leave the OST active and the clients will know not
>> to write data to that OST?
>> >
>> > Thanks,
>> > -J
>> >
>> > _______________________________________________
>> > Lustre-discuss mailing list
>> > [email protected]
>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> >
>> >
>> > _______________________________________________
>> > Lustre-discuss mailing list
>> > [email protected]
>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Principal Engineer
>> Whamcloud, Inc.
>>
>>
>>
>>
>

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Lustre client error

Reply via email to